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ABSTRACT 


WORD  STATISTICS  IN  THE  GENERATION  OF 
SEMANTIC  TOOLS  FOl  INFORMATION  SYSTEMS 

A  crucial  problem  in  systems  tor  the  storage  and  retrieval  of  technical 
information  i*  the  interpretation  of  words  used  tu  index  documents.  Semantic 
tools,  defined  as  channels  for  the  communication  of  word  meanings  between 
technical  experts,  document  indexers,  and  caarchare,  provide  one  method  of 
dealing  with  the  problem  of  multiple  interpretations.  This  report  shows 
how  statistical  data  on  tbe  distribution  of  occurrences  of  single  words  or 
words  or  word  pairs  in  the  text  of  e  set  of  documents  can  be  used  in  gen* 
eratlng  semantic  tools,  in  particular,  in  indexing  vocabulary  and  relations 
among  the  terms  In  this  vocabulary.  An  experiment  in  thia  aree  la  described. 
Involving  the  testing  of  seversl  new  statistical  measures  end  techniques. 

The  results  give  some  Insight  into  the  patterns  of  language  usage  in  tech¬ 
nical  literature  end  suggest  directions  for  future  research. 
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1.  SEMANTIC  TOOLS 
1.1  Introduction 

In  many  modern  Information  systems  Che  important  topics  of 
each  document  in  the  system' are  represented  by  a  list  of  index  terms 
assigned  to  that  document.  The  index  toms  aro  generally  natural 
language  words  or  word  phraaoa,  and  people  desiring  information  from 
documents  in  the  system  will  formulate  their  information  naod  In  termn 
of  the  indexing  vocabulary.  The  system,  in  response  to  an  information 
request,  will  present  the  searcher  with  a  set  of  documents,  each  of 
which  is  associated  With  a  set  of  index  terms  satisfying  the  searcher's' 
specification.  (Alternatively,  the  oyotem  could  supply  a  list  of 
citations  and  perhaps  abstracts  for  the  documents  which  fulfill  the 
searcher's  specification.)  Most  information  systems,  including 
libraries,  also  have  provision  for  access  to  documents  through  ocher 
attributes,  such  as  author  or  title,  but  this  approach  will  not  be 
treated  here.  It  is  clear  that  in  the  frequent  cases  where  index 
Cerme  are  the  main  intermediary  between  the  documents  in  e  system  end 
che  users  of  the  system,  the  usefulness  of  the  system  as  e  whole  Is 
strongly  dependent  on  the  index  terms. 

There  ere  tvo  basic  problems  in  using  isolated  words  or  word 
phrases  to  denote  copies  or  concepts  (see  the  discussion  by  Phy'lls 
Beisner  (1965)  ).  One  is  Chet  many  concepts  esn  be  designated  in  a 
number  of  different  ways  by  phrases  which  ere  synonyms  or  near  syno¬ 
nyms  of  each  other;  hence  there  is  a  lot  of  redundancy  in  the  lan¬ 
guage.  The  other  is  that  many  words  are  ambiguous  because  they  ere 
homographs  (l.e.,  they  have  multiple  unrelated  meanings).  Evan  words 
which  are  not  homographs  can  have  severe 1  interpretations  depending 
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on  context,  find  almost  all  word 9  Kevo  a  haxy  boundary  region  whar* 
their  applicability  la  not  agreed  upon. 

Each  expert  understands  etui  uses  the  vocabulary  of  hia  field 
,n  a  slightly  different  way,  and  indexers  era  confronted  with  the 
double  problem  of  deciding  how  an  author  ueea  Che  technical  vocabulary 
in  a  document  end  choosing  index  terns  for  that  document  which  are 
consistent  with  the  indexing  of  other  documents.  Similarly,  the  user's 
interpretation  of  the  words  in  which  he  first  phrases  a  request  for 
information  in  some  urea  depends  on  the  docunents  he  haa  previously 
read  in  that  araa,  hia  background,  hie  area  of  specialisation,  and  Me 
immediate  Information  requirement.  The  initial  request,  therefore, 
may  employ  quite  e  different  vocabulary  from  that  of  the  indexers. 

Ae  a  consequence,  in  order  for  an  information  system  to  se^rve 
its  purpose  of  connecting  e  user  with  information  he  needs,  it  must 
channel  different  interpretations  of  vorda  into  a  more  nearly  uniform 
interpretation.  Wc  have  defined  a  semantic  tool  as  any  device  which 
Accomplishes  this  by  giving  the  indexer  or  searcher  information  abour 
how  words  arc  used  in  the  system.  Semantic  tools  are  actually  a  medium 
of  communication  among  aubject  area  axperta  (who  will  aid  In  generating 
them),  indexers,  and  users.  (See  Figure  1-1.) 

The  rest  of  this  section  of  the  report  will  treet  semantic 
tools  in  aoM>  detail.  Ttaa  second  eeetion  of  the  report  will  discus* 
an  approach  to  generating  semantic  tools  which  utilisea  statistical 
data  dsrtvad  from  document  text#.  Section  3  will  describe  en  experi¬ 
ment  in  this  area  carried  out  by  the  author,  and  Section  4  will  be  e 
discussion  of  tha  raaulta  of'  this  axperimsnt.  The  final  eeetion  of 
the  report  will  contain  proposals  for  further  reseerch  in  this  eree. 
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Figure  1-K  Flow  of  Semantic  Information  through 
the  Semantic  Toole  of  en  Information  System, 
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1  •  I  tyi'i.ui  <if  Bomnnilo  Toole 

There  are  two  general  type"  coeimunicacion  problems  that 
arise  in  information  system*  and  require  semantic  Cool*  for  their 
solution.  The  first  concerns  the  command  language;  the  user  should 
i»u  told  how  to  make  requests  which  are  intelligible  to  the  system, 
i.e.,  syntactically  correct.  The  second  dealt  with  the  interpreta¬ 
tion  of  index  terms,  and  this  problem  will  he  the  focus  of  our  atten¬ 
tion  here. 

There  is  s  wide  spectrum  of  semantic  tools  for  index  terms, 
differing  in  the  flexibility  of  the  format  in  which  semantic  informa¬ 
tion  is  presented  and  in  the  quantity  of  information  presented  at  one 
time.  At  one  end  of  the  spectrum  Is  the  documentalist's  thesaurus, 
in  which  Information  Is  supplied  in  e  rigid  format  and  in  smell  quan¬ 
tities.  Given  s  term  ,'uul  a  relation  (e.g.,  synonymy  or  "generic  to") 
Che  thesaurus  will  supply  e  short  list  of  ell  the  terms  related  to 
the  given  term  by  the  given  relation.  The  baalc  unit  of  information 
in  s  thesaurus  might  be  stored  in  a  computer  as  an  ordered  triple: 
relation  i,  term  J,  term  k.  Greeter  flexibility  ia  made  possible  by 
introduction  of  more  relatione,  such  as  whole-part  or  process-produet. 
(This  is  somewhat  similar  to  tha  use  of  roles  whan  indexing  documents.) 
Relations  among  three  or  mors  tanas  give  even  greater  variety  in  con¬ 
veying  semantic  information;  at  this  point,  the  relation  is  perhaps 
best  expressed  in  terms  of  e  skeletal  sentence  such  ae  "X  is  a  type 
of  Y  used  in  the  production  of  Z".  Scope  notea  provided  for  some  of 
che  terms  offer  still  greater  flexibility,  while  the  far  end  of  the 
semantic  tools  spectrum  is  represented  by  unrestricted  natural  lan¬ 
guage  sentences,  such  as  definitions  in  e  glossary  of  terms  in  a 


•ubject  area.  hoce  there  la  no  limit  on  tba  amount  of  information 
supplied  about  a  tarn. 

An  Important  characteristic  of  an  information  storage  and 
retrieval  system  is  the  amount  of  restriction  placed  on  tha  vocabulary 
used  for  indexing,  and  this  affects  the  use  of  semantic  cool*  in  the 
system.  Some  systems  use  a  controlled  vocabulary  or  authority  list 
while  others  have  essentially  no  restriction  on  choice  of  index  terms. 
for  a  system  with  a  controlled  vocabulary,  the  thesaurus  plays  a  very 
important  role  (Horner  (1963),  Henderson,  e.t  nl,  (1966)  ).  ft  la  the 
method  by  which  the  user  transforma  his  vocabulary  into  the  vocabulary 
of  the  system.  Zf  tha  controlled  vocabulary  consists  of  natural  lan¬ 
guage  words,  semantic  fools  can  show  the  specisl  sense  in  which  these 
words  era  used  in  the  system.  If  the  vocabulary  consists  of  codes  for 
concepts,  semantic  tools  can  serve  as  definitions  of  these  coded  con¬ 
cepts.  The  thesaurus  also  has  a  special  function  in  a  system  with  an 
unrestricted  indexing  vocabulary.  Here  it  supplies  the  user  with 
synonyms  or  near  synonyms  for  the  words  he  originally  thought  of, 
hence  allowing  him  to  retrieve  a  larger  fraction  of  the  relevant  docu¬ 
ments.  Recent  research  by  Cyril  Cleverdon  (1966)  has  favored  an  un¬ 
restricted  vocabulary  over  a  controlled  vocabulary  in  one  situation. 

1.3  Classification  Tables  and  Semantic  Expansions 

We  will  now  examine  in  some  detail  two  devices  near  opposite 
ends  of  the  spectrum  of  semantic  tools.  Ona  device  Is  closely  related 
to  a  thesaurus  while  the  other  device  involves  the  use  of  ccstplete 
sentences  in  natural  language. 

Since  the  word  "thesaurus"  has  several  connotations,  let  us 


uoo  tlifl  phrase  "clnasifiention  table"  to  represent  •  tvo-plase  rela¬ 
tion  within  a  eat  of  words.  A  clarification  cable  can  be  atored  aa 
a  two-dimensional  array,  where  an  entry  la  call  indicates  that  the 
relation  holds  between  word  i  and  word  j.  A  portion  of  a  classifica¬ 
tion  tabic  is  displayed  in  tabular  form  in  Figure  1-2.  A  thesaurus 
listing  synonyms,  broader  terms,  and  narrower  terms  can  be  stored  In 
three  classification  tables,  one  for  each  relation  (or  la  two  tables, 
since  brosder  and  narrower  ere  inverse  relatione). 

A  classification  table  can  be  looked  upon  aa  a  directed  graph, 
oach  node  corresponding  to  a  word  in  the  subject  area  vocabulary,  and 
a  branch'  between  two  nodes  indicating  that  the  relation  holds  between 
the  corresponding  words,  taken  in  the  order  specified  by  the  direction 
of  the  branch.  A  graph  corresponding  to  a  set  of  classification  tables 
would  have  labeled  branches,  each  label  indicating  which  relation  ita 
branch  represents.  A  tree  or  hierarchy  can  be  represented  in  e  class¬ 
ification  table,  but  a  classifies  cion  table  can  encompass  more  general 
structures  aa  well.  This  greater  -  rality  la  desirable  since  moat 
documentallsts  have  come  to  the  cone  usit  *<  that  ell  of  knowledge  can¬ 
not  be  meaningfully  arrangtd  into  a  sing Is  'Hierarchy  nor  can  all  the 
words  in  even  e  single  ares  of  discourse. 

The  creation  of  a  generic-specific  tree  for  the  vocabulary  of 
the  computer  programming  field  illustrates  one  of  the  problems  with 
hierarchies.  Any  node  in  a  tree  may  have  several  branches  leaving  it 
(and  leading  to  nodes  representing  more  specific  concepts),  but  cannot 
have  more  than  one  branch  entering  it  (from  a  more  generic  concept). 

The  word  "compiler"  haa  several  specifics  (e.g>,  ALGOL  compiler, 

FORTRAN  compiler)  but  It  also  has  at  least  two  generic  tense,  namely 
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program  and  proeooaor,  A  tree  would  net  permit  these  two  viewpoints 
about  compiler!,  l.e.,  considering  them  aa  both  programs  and  proceaeore 
of  programs,  but  instead  would  require  time  a  single  generic  be  speci- 
fled.  Classification  tables  have  no  such  restriction. 

Claesification  teblee  should  be  eveileble  both  to  indexers 
while  Indexing  documents  end  to  seerchere  while  looking  for  documents 
to  satisfy  particular  Information  needs.  In  one  sense.  Indexing  and 
searching  are  reciprocal  activities --in  the  flrat  case  the  input  la 
documents  and  the  output  la  index  terms;  in  the  second,  the  input  is 
terms  and  the  output  la  documents.  Both  activities  are  highly  depen¬ 
dent  on  the  organisation  of  the  vocebulery  of  index  terms,  e.g.  as 
represented  by  classification  tables.  It  is  likely  that  the  jeer  of 
clasaificetion  tables  will  specify  e  word  end  want  to  eee  ell  the 
words  related  to  it  by  one  or  more  relationship*.  He  will  note 
which  of  the  words  seems  applicable  to  hie  problem,  end  than  specify 
another  word  (perhaps  one  of  the  applicable  one*  just  displayed)  for 
which  he  went*  to  see  related  words.  This  process  will  be  repeated 
for  e  sequence  of  words.  The  information  thus  displayed  will  help 
the  user  understand  specific  tans  by  relating  them  to  other  terms 
end  will  also  suggest  new  terms  which  night  be  appropriate  in 
indexing  or  requesting  documents.  It  is  expected  that  through  the 
use  of  classification  tables,  e  user  will  clarify  hie  concept  of  the 
content  of  the  document  he  is  indexing  or  searching  for,  and  will 
express  this  content  with  less  variation  (from  indexer  to  searcher 
or  from  searcher  to  searcher)  In  the  tarns  of  the  system  vocabulary 
then  without  this  tool. 

At  the  other  end  of  the  spectrum  of  semantic  tools  is  the  use 
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of  natural  language  sentences,  such  as  definition*  of  tmi  or  ahorc 
t  *s*ys  on  teres  or  relations  between  terms.  Ho  hove  chosen  to  apply 
the  none  "semantic  expansion"  to  a  sequence  of  increasingly  de tailed 
descriptions  si  system  operation  or  definition*  ox  index  terms  in 
sentence  form.  The  idee  of  semantic  expansion  is  this!  e  user 
desiring  information  about  a  term  might  first  went  to  see  «  brief 
definition  of  Chi*  term.  If  the  brief  definition  isn't  clear  to  him 
or  does  not  answer  hia  questions  about  the  term,  he  might  wish  to 
use  e  store  detailed  definition.  If  this  is  insufficient,  perhaps  e 
paragraph-long  essay  on  the  term,  with  examples,  might  bo  called  for. 
A  typical  oxemple  la  shown  In  Figure  1-3. 

Semantic  expansions  can  be  Implemented  In  the  form  of 
programed  instruction,  with  the  system  testing  the  user  after  each 
level  of  expansion  to  determine  whether  the  next  level  of  expansion 
is  needed.  Since  semantic  expansions  require  e  fair  amount  of  work 
to  generate,  they  will  first  be  used  only  for  the  most  important 
terms  in  the  subject  ares  vocabulary  (end  only  for  the  most  important 
system  operation  commands).  Ocher  terms  will  be  related  to  these 
tame  through  classification  tables,  end  each  term  for  which  there  is 
e  semantic  expansion  can  have  an  identifying  mark  when  it  occurs  in 
e  classification  table. 

1.4  Implementation  of  Semantic  Tools 

Thera  are  e  number  of  practical  questions  to  be  answered  in 
connection  with  the  implementation  of  semantic  side.  One  basic 
question  is  how  much  ahoulu  be  automated  or  mechanised.  A  computer- 
based  system  is  quite  desirable  because  it  allows  conversational 
interaction  between  the  indexer  or  eeercher  end  the  systeau 
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N.B, :  Underlining  indicates  that  further  explanation  is  svsi labia  from 
the  system  by  specifying  the  underlined  word  or  word  phrase. 


A.  First-Level  p.eeponse: 

"An  INTERPRETIVE  PROGRAM  is  a  computer  program  that  combines 
translation  and  execution." 

B.  Second-Level  Response: 

"An  INTERPRETIVE  PROGRAM  is  a  computer  program  which  receives 
a  sequence  of  inatructiona  in  a  source  language,  and  for  aach 
instruction  identifies  the  oparatlon  and  operand (si  ‘and  than 
performs  the  action  specified  by  the  instruction'." 

C.  Third-Level  Response: 

"The  major  characteristic  of  an  INTERPRETIVE  PROGRAM  is  that 
a  source  language  instruction  is  both  recognised  end  performed 
each  time  it  is  encountered.  The  interpretive  program  remains 
in  control  during  the  whole  process.  This  is  to  be  contrasted 
with  the  action  of  a  compiler,  which  trsnelatee  all  source 
language  Inatructiona  before  any  of  them  are  performed,  then 
turns  control  over  to  the  compiled  (translated)  program  which 
is  executed.  The  step-by-step  translation  end  execution 
which  occurs  under  an  interpretive  program  permits  one 
Instruction  to  modify  another.  When  the  modified  instruction 
is  encountered  it  will  be  translated  into  a  different  action 
from  before." 

D.  Fourth-Level  Response: 

"The  following  segment  of  e  source  language  program  will  be 
used  to  demonstrate  the  action  of  an  INTERPRETIVE  PROGRAM: 


Poaition 

Instruction 

1 

FITCH  100 

2 

ADD  101 

3 

STORE  115 

Tha  interpretive  program  will  examine  the  instruction  “FETCH  100“, 
separate  off  tha  operator  part  ("FETCH"),  and  by  comparison  with 
a  atorad  sat  of  sourcs-  language  operations,  recognise  this  aa 


Figure  1-3.  A  Semple  Semantic  Expansion.  (Beginning) 


the  Instruction  to  bring  the  contents  of  sobs  location  into 
the  accunulator.  It  will  then  transfer  control  to  a  subroutine 
which  will  cause  the  contents  of  location  100  to  be  brought  into 
a  s inula ted  accumulator.  (The  actual  acoumuln tor  is  used  by  the 
interpretive  program  for  such  things  an  cir*codinn  "nurce  language 
instructions,  end  hence  is  not  evaileble  to  the  nourcn  language 
proftfarQ  the  next  instruct  ton  "ADD  101"  will  be  recognised 
in  a  aiadlar  way,  and  the  appropriate  aubroutine  will  cauae 
the  content#  of  location  101  to  be  added  to  the  contents  of 
the  simulated  accumulator,  leaving  the  result  in  the  siwlatad 
accumulator.  The  instruction  "STOtE  115"  will  than  be  recog- 
niaed  and  a  eubrOutina"  wi'll*~plaaa  the  contents  of  tha  aiaulated 
accumulator  into  location  115." 


Figure  1-3 


(Conclusion) 
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Attuning  t  Mthtnltoti  system,  dots  tht  increased  versatility  and 
■peed  of  video  display  of  semantic  tools  Justify  the  difference  In 
cost  between  video  display  end,  for  example,  teletype?  Is  herd  copy 
desirable  in  any  case  as  a  panunant  record  for  the  user  of  the 
system?  In  what  format  should  the  information  in  classification 
tablet  he  displayed?  How  many  levels  of  semantic  expansion  should 
be  provided?  The  answers  to  those  question#  ere,  of  course,  highly 
dependent  on  the  expected  application*  end  the  environment  of  ttw. 
system  to  be  implemented. 

A  semantic  tool  should  be  capable  of  continual  modification 
in  raaponae  to  the  evolution  of  the  vocabulary  of  the  field  with 
which  it  deals.  The  stimulus  for  modification  could  come  from  a 
subjretive  review  or  from  computer  analysis  of  user-syatsm 
dialogues,  index  term  usage,  end  statistics  derived  from  the  texts 
of  recently  acquired  documents.  Another  alternative  is  to  lot  the 
indexers  end  searchers  modify  the  semantic  tools  (cf.  Kelsner 
(1965)),  Presumably,  some  review  or  control  of  the  results  would 
be  desirable  here  too. 

A  pilot  system  for  storage,  manipulation,  end  retrieval 
of  classification  table  information  has  bean  implemented  in  the 
L6  language  by  John  S.  Edwards  (1967),  Zo  this  system  new  words 
can  be  added  to  the  system  vocabulary  or  old  words  dsleted  from 
it  at  any  time,  a  relation  between  a  pair  of  words  can  be  added  to 
or  delated  from  any  classification  table  with  ease,  and  new  types 
of  relations  can  be  daflnad  sad  incorporated.  Experience  with 
this  laboratory  tool  Is  sxpected  to  load  to,  greater  insight  into 
classification  tables. 
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Thara  in  a  number  of  way*  In  whloh  semantic  Cool*  could 
«•  giMdiwi  UrigUally.  Tam y  uwuld  **  produced  by  cm  or  nor* 
■ubjact  iru  export a,  or  darlvad  from  th#  technical  literature  by 
doeuwntaUate  or  information  retrieval  specialists.  However,  wa 
|«*1  that  tb*r*  1*  cia»#i4*r«ble  pre®is«  In  the  approach  to  gsftora- 
tlon  of  MaMBtle  tool*  which  utllltaa  statistical  propartl**  of 
th*  distribution  of  word  occurrences  In  the  subject  area  lltaratura. 
While  thi*  1*  large ly  an  automatic  (computer -baaed)  process,  human 
Mlactloa  of  text  Input  and  editing  of  output  art  Important.  The 
following  auction  treat a  statistical  procusaing  of  taxt. 


14. 


2.  STATISTICAL  TECHNIQUES  IN  SEMANTIC  TOOL  GENERATION 

? ,  1  The  Tmpnrt-nncp  of  the  ?>tA.t  t  *t  1  CO  t  Approach 

The  growing  rate  .it  which  technical  literature  Is  being  pro* 
ducctl  Is  pitting  an  increasing  attain  on  currant  systems  for  dissemln- 
a  t 1 oi t  or  storage  and  retrieval  of  technical  Information.  One  answer 
to  this  problem  la  the  Automation  of  many  of  the  activities  of  an 
information  center  or  information  system,  (Soo  the  discussion  by 
Gerard  Sslton  (196*)  on  automatic  information  systems.)  U  la  tho 
contention  of  many  researchers  that  a  great  deal  of  what  appears  to 
be  intellectual  tasks  In  the  processing  of  technical  docuswnts  can  ba 
at  loast  partially  mechanised.  Such  tasks  include  generation  of  an 
Indexing  vocabulary  and  relations  among  the  terms  In  it,  Indexing 
of  incoming  documents,  and  assistance  to  searchers  In  formulating 
Information  rcqu#st».  The  main  source  of  information  used  in  the 
performance  of  these  task*  by  computer  is  statistical  data  on  the 
distribution  of  occurrences  of  words  in  the  document  texts.  The  suc- 
cesses  of  the  statistical  approach  are  a  result  of  the  correlation 
between  statistical  measures  end  syntactic  or  semantic  properties  of 
Importance  in  information  retrieval.  The  following  brief  survey  will 
glvo  an  Idea  of  the  types  of  statistical  processing  which  have  been  in¬ 
vestigated  In  the  past. 

2 . 2  Examples  of  the  Statistical  Approach 

Various  typos  of  statistics  have  been  used  in  generating  a 
technical  vocabulary.  The  total  frequency  of  occurrence  of  a  word  in 
a  large  text  sample  can  be  used  to  identify  words  which  beesuse  of 
their  very  high  or  very  low  frequency  should  not  bo  In  the  indexing 
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vocnl.iilnry  (II.  P.  LmIw  (195*)  )i  I'ltv  number  of  different  word*  with 
a  given  ward  co-occurs  (divided  'ey  the  frequency  of  the  ||v«rt 
word)  vox  Investigated  ae  an  Indicator  of  good  words  to  use  In  Index¬ 
ing  by  Robert  Curtice  and  Paul  Jones  (1967).  Their  Idea,  roughly,  Is 
that  h  word  which  appears  with  n  great:  many  other  words  Is  not  likely 
to  nuke  a  good  Index  term,  but  one  which  tends  to  appear  In  restricted 
contexts  Is  likely  to  make  a  good  one.  The  vocabulary  generation 
experiments  performed  by  ehe  author  were  based  primarily  on  measure* 
of  the  distribution  of  the  occurrences  of  a  word  astong  documents  In 
the  collection,  l.c.,  descriptive  statistics  for  the  set  of  vithin- 
docunent  frequencies  for  each  word.  The  assumption  was  that  a  word 
concentrated  in  a  few  documents  is  more  likely  to  be  a  technical  term 
than  a  word  spread  thinly  among  a  large  number  of  document!. 

A  number  of  different  statistical  measures  have  been  proposed 
for  use  in  automatic  Indexing  of  documents.  One  of  the  most  promising 
classes  of  mcasuros  consists  of  functions  of  both  the  froquonoy  of  a 
word  in  a  document  and  the  frequency  of  the  some  word  In  general  usage. 
H.  P.  Edmundson  end  It.  E.  Viyilys  (1961)  pointed  out  that  measures  of 
this  type  will  single  out  words  which  are  rare  In  normal  use  but  fre¬ 
quent  In  a  given  document  and  likely,  therefore,  to  be  the  names  of 
tho  specialised  concepts  with  which  the  document  deals.  Fred  Damoreu 
(1965)  reported  on  an  experiment  comparing  several  functions  of  this 
kind,  In  which  the  function  with  tho  best  performance  was  the  proba¬ 
bility  that  a  word  with  a  known  total  frequency  in  a  large  reference 
collection  would  have  at  least  as  many  occurrences  as  It  did  In  a 
particular  document  If  its  vlthin-document  frequencies  had  a  Poisson 
distribution.  John  O’Connor  (1965)  described  an  experiment  lnvestiga- 
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Una  the  automatic  assignment  of  two  Index  term*  ("texlelty"  ami 
"penicillin")  co  documents  which  didn't  necessarily  contain  thou  In 
their  texts. 

Automatic  ci.ir.r.W' 'Ration  of  documents  into  predetermined 
c.-itcnorleu  is  a  problem  ctu  ,  ly  related  to  automatic  indexing  and  ha* 
been  treated  by  M.  E.  Naron  (1961)  and  J,  li.  Williams  (1965),  among 
others.  Research  has  also  been  conducted  In  automatic  genoratlon  of 
categories  to  uso  In  classification  by  means  of  such  techniques  as 
clumping  (R.  M.  Needham  (1962)),  factor  analysis  (H.  Borko  and  M. 

Bomtctc  (1963,1964)),  and  latent  class  analysis  (F,  D.  Baker  (1963), 

W.  K.  Winters  (1965)).  These  techniques  require  Information  about  the 
joint  occurrence  of  two  or  more  words. 

Statistical  association  measures  are  functions  of  the  number  of 
times  a  pair  of  terms  appear  together  in  a  textual  unit,  e.g.,  sentence, 
or  In  the  sot  of  terms  indexing  a  document.  The  measures  are  designed 
to  be  indicators  of  the  tendency  for  two  words  to  eo-oecur.  If  two 
terms  appear  together  very  often,  there  is  probably  a  semantic  or 
empirical  relation  between  them.  The  matrix  containing  measures  of 
statistical  association  between  all  pairs  of  Index  terms  could  be  used 
directly  by  Indexers  or  searchers,  or  could  he  the  basis  for  sn  auto¬ 
matic  classification  procedure  which  forms  groups  of  terms,  as  mentioned 
above.  Alternatively,  the  measures  could  ba  employed  In  the  retrieval 
phase  to  expand  requests  automatically  by  addition  of  terns  having  a 
high  statistical  association  with  the  original  request  terms.  The 
most  satisfactory  utilisation  of  these  statistical  relations,  however, 
would  probably  be  in  a  man-machine  dialogue.  In  which  the  computer 
would  use  stored  association  measures  or  classification  tablet  In 
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suggesting  tutw  tsk-ma  for  the  user's  coneideracion,  The  computer  could 
exhibit  "adaptive"  behavior  by  utilising  the  feedback  iron  the  user 
(approval  or  disapproval  of  previously  suggested  terms)  in  making  new 
suggestions.  Experimentation  in  this  area  is  described  leter  in  this 
report  end  in  more  detail  by  John  3.  Edvards  (1967). 

Among  the  earliest  workers  to  use  the  idee  of  statistical 
association  In  the  context  of  information  retrieval  were  Melvin  E.  Heron 
and  J.  L,  Kuhns  (1960).  Another  importent  early  paper  on  atetletlcel 
association  techniques  is  that  of  H.  Edmund  Stiles  (1961).  He  made  the 
"association  factor"  of  a  pair  of  terms  depend  on  .heir  joint  appearance 
in  the  sets  of  terms  indexing  documents,  as  did  Heron  and  Kuhns.  He 
pointed  out  that  a  pair  of  synonyms  may  have  a  low  asaoelatlon  factor, 
since  they  will  not  generally  be  used  to  Index  the  same  document.  But 
on  the  ot^r  hand,  they  ere  both  likely  to  have  high  associations  with 
terms  related  to  the  concept  they  both  represent.  Stilas  celled  this 
a  second  generation  association.  A  detailed  treatment  of  the  methane* 
tics  of  associative  retrieval  came  from  Vincent  Giulianc  and  Paul  Jonas 
(1963).  They  proposed  a  model  in  which  a  linear  transformation  (invol¬ 
ving  association  measures)  of  a  request  vector  (equivalent  to  a  list  of 
terns)  reaalts  in  a  response  vector  (equivalent  to  a  list  of  documents 
with  relevance  scores  for  each  one).  Their  formulation  uses  not  only 
the  basic  matrix  of  term-term  association  measures  but  also  higher 
powers  of  this  matrix,  end  they  shoved  how  the  entries  in  the  square  of 
this  matrix  (or  other  even  powers  of  it)  could  be  considered  measures 
of  synonymy.  A  reesnt  investigation  of  statistical  discrimination  of 
the  synonymy /an tonymy  relation,  employing  co->iJur< rr.ee  of  triples  at 
well  as  pairs.  Is  reported  by  P.  A.  W.  Lewis,  P.  B.  Baxendalt.,  and  J.  L. 
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Bennett  (1967). 

2 . 3  Statistical  Properties  of  Lnnftuap.e  'Usqd  In  Che  Current  Experiment 
Figure  2-1  exhibits  some  word  groups  of  interest  for  inionsa- 
tion  ret.  '<1.  On  the  left  side  is  «  picture  showing  how  we  define  a 
common  word.  Consider  that  every  word  (l.e..  word  type)  which  appears 

in  some  large  selection  of  literature  not  concentrated  in  any  particu-  J 

lar  subject  area  is  represented  by  a  point  in  the  rectangular  box.  The 
distance  of  this  point  from  the  bottom  of  the  box  is  a  function  of  the 
relative  frequency  of  the  word,  relative  frequency  being  the  absolute 
frequency  of  a  word  type  in  a  text  (l.e.,  the  number  of  ite  occurrences 
or  tokens)  divided  by  the  length  of  that  text  (i.e.,  the  total  nunber 
of  tokens  of  all  words  in  ki*t  text).  More  precisely,  the  height  of 
a  point  is  proportional  to  the  percentile  on  a  frequency  basis  of  the 
word  represented  by  the  point.  We  define  e  comae n  word  ea  one  which 
has  a  relative  frequency  greater  than  some  value  in  the  language  aa 
a  whole,  aa  approximated  by  a  sample  of  general  literature.  On  the 
right  is  en  illustration  of  the  important  categories  of  words  in  lit' 
ersture  dealing  with  a  single  specialty  area.  Tha  horisontal  line  near 
the  middle  of  the  rectangle  corresponds  to  the  percentile  which  hae 
above  it  as  many  words  types  as  tha  percentile  defining  common  words 
in  the  general  literature.  The  percentile  is  lower  on  the  right  be¬ 
cause  there  are  fewer  types  in  tha  sample  of  specialised  literature. 

Note  that  some  of  the  words  which  were  cocston  words  in  the  general 
literature  are  now  below  the  dividing  line  in  the  specialised  liters' 
ture.  Their  place  above  tha  dividing  llna  is  taken  primarily  by 
specialty  terms  which  had  s  lower  relative  frequency  in  the  general 
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Note:  Core  Terms  nnd  Particular 
Terms  Together  Comprise  the 
Specialty  Vocabulary. 


Figure  M.  Word  Croups  in  Two  Types  of  Text  Collections. 


liter* Cura  *nd  whieh  vs  have  labolsd  core  tint.  Then  ere  fche 
high-frequency  words  In  the  technical  vocabulary  of  the  subject  area. 
The  name  of  che  field  Itself  would  be  a  core  tern.  The  region  labeled 
"Partlculnr"  contains  the  mid-frequency  and  low-frequency  technical 
vocabulary.  It  it  likely  that  many  of  the  particular  words  have  a 
greater  relative  frequency  or  frequency  percentile  in  the  apecialiaed 
literature  than  in  the  general  literature,  end  they  ere  expected  to 
make  good  index  terms.  It  is  important  to  note  that  the  partition  of 
specialty  words  into  core  and  particular  words  is  quite  dependent  on 
the  scope  of  the  set  of  subject  eras  document#  under  consideration. 

A  cove  word  for  a  very  specialised  field  will  be  e  particular  word  In 
a  broader  field  where  it  shares  the  focus  of  attention  with  important 
terms  from  the  other  specialised  subfields  which  Joined  to  form  the 
broader  field.  Hence,  if  the  scope  of  e  document  collection  le  likely 
to  change,  It  is  desirable  to  use  core  words  as  index  tame,  in  addi¬ 
tion  to  particular  words.  The  amalgamation  of  subfields  can  be  visu¬ 
alised  ea  e  "diluting"  process  with  respect  to  the  relative  frequencies 
of  ths  core  terms  for  the  subfields.  But  e  "uoncentrating”  process  is 
going  on  at  the  same  time  with  respect  to  the  words  which  will  become 
the  core  words  for  the  broader  field,  since  they  will  sppeer  with  e 
moderate  frequency  in  each  of  the  subfields.  The  relative  frequency 
of  a  core  word  for  the  broader  field  will  be  e  suitable  weighted  aver¬ 
age  of  the  relative  frequencies  in  the  subfields,  and  hence  no  larger 
than  the  largest  of  them.  However,  the  percentile  of  the  new  core  word 
will  be  greater  in  the  broader  field  than  in  any  of  the  narrower  sub- 
fields  merely  because  of  the  increase  in  number  of  words  types  of  lower 
relative  frequency  as  eubfielda  are  merged.  If  this  merging  process  is 
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repeated  a  aumbev  of  times,  tha  Mop#  of  the  broader  field  will  beaoae 
gi  later  and  greater,  and  tha  nuabar  of  now  word  types  addad  will  begin 
to  dacllna.  Than  tha  percentile  incraaaa  for  potential  broad -sc ope 
corn  ttru  due  to  tha  addition  of  new  types  will  bn  progressively  leas 
and  loss;  tha  number  of  core  tarsi  demoted  to  the  particular  category 
by  the  dilution  effact  will  exceed  the  number  of  potential  broad-scope 
cora  terma  actually  promoted  into  tha  cora  category,  and  the  aise  of 
the  cora  tarn  aat  will  decreaae.  In  the  limiting  caaa  where  tha  sub¬ 
ject  area  ia  "all  knowledge"  there  will  be  no  cora  terma,  by  definition. 

Thla  disc  union  pointa  out,  among  other  things,  that  tha  cor* 
relation  between  frequency  percentile  and  degree  of  generality  of  a 
apacialty  word  will  not  be  complete,  tinea  cora  worda  for  broader  fields 
era  mixed  with  other  particular  worda  in  tha  narrower  fields.  Another 
point  to  be  made  from  the  pictorial  representation  of  worda  groups  ia 
that  there  la  no  raaaon  to  believe  that  a  single  simple  statistical 
measure  will  serve  to  aaparata  tha  apacialty  from  tha  non-specialty 
(Including  common)  worda  at  all  frequencies.  Because  of  tha  different 
nature  of  the  "strata"  at  different  frequencies,  it  might  even  be  tha 
caaa  that  different  types  of  measures  would  be  needed  to  perform  tha 
aaparatlon  for  different  frequency  regions. 

The  partition  of  worda  into  common  and  non-conaon  on  the  basis 
of  their  frequency  in  the  general  literature  ia  a  first  approximation 
to  separation  of  function  words  from  content  words,  (Tunc Cion  words 
era  worda  whose  role  ia  primarily  syntactic,  a.g. ,  conjunctions,  arti¬ 
cles,  and  prepositions).  The  common  word  aat  will  contain  the  high* 
frequency  function  words  plus  a  few  content  words  which  will  generally 
not  be  apacialty  worda,  for  example,  "man",  "bouse",  "make",  and  "give". 


If  lc  In  flea  trod  Co  separata  specialty  word*  from  non-apeaiaUyj  tb* 
common  word*  can  pretty  safely  be  excluded  from  conalderatlon  ea 
specialty  word*.  One  way  to  aeparate  the  apeclalty  word*  fro*  the  non- 
specialty  words  which  are  not  common  word*  1*  through  neaaure*  of  the 

i  *  t 

distribution  of  words  aiaong  document*,  working  on  the  assuoption  that 
each  specialty  word  will  hava  a  tendency  to  be  concentrated  in  the 
document*  for  which  it  is  relevant  and  relatively  rare  in  the  rest. 

To  be  more  precise  about  this,  lat  ua  eoneider  the  frequency  dlatrl- 
butiona  graphed  in  Figure  2-2.  Zt  ha a  bean  our  hypotheala  that  the 
Poisson  distribution  can  very  roughly  deacriba  the  diatribution  of 
w' thin-document  frequencies  for  a  given  word.  (Damarau's  (1965)  arti¬ 
cle  wee  the  stimulus  for  this  idea.)  The  probabilltea  aaaoclated  with 
tha  Poiason  distribution  are  actually  appropriate  for  Che  case  where 
the  occurrences  of  e  word  ere  distributed  randomly  throughout  tha  text. 
This  suggests  that  we  look  for  deviation#  from  the  Poiaeon  distribution 
ee  e  clue  that  e  word  of  importance  in  the  subject  area  has  beau  de¬ 
liberately  clustered  in  the  documents  to  which  it  is  relevant. 

Tha  two  graphs  at  tha  top  of  Figure  2-2  are  Poisson  distribu¬ 
tions  for  two  function  words  of  different  frequencies.  The  graph  at 
the  bottom  ie  obtained  by  deliberate  cluetarlng  in  a  few  documents  of 
the  occurrences  of  s  mid-frequency  specialty  word.  Because  of  this 
clustering,  it  appears  in  fewer  documents  then  a  function  word  of  tha 
same  frequency.  Thus  the  bar  above  0  is  higher  for  the  specialty  word 
then  for  tha  function  word,  indicating  that  it  baa  xaro  frequency  In 
more  doewents.  And  in  tha  document*  where  tha  specialty  word  doaa 
appear,  it  tends  to  have  a  higher  frequency;  hence  tha  bare  above  4,  5, 
and  6  are  higher  than  for  tha  corresponding  function  word. 
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Figure  2-2.  Typical  Vithln-Docu»«nt  Praquoncy  Distribution*. 


24. 


Ms  tasted  a  number  of  different  standard  distribution  measures, 
till  baued  on  th«  set  of  vithin-documant  frequencies,  for  example,  Che 
variance  and  coefficient  of  tkovnees.  la  addition,  we  teetad  two  new 
measures  which  were  explicitly  related  to  the  foie son  distribution. 
Those  measures  ere  described  in  Section  3, 

Sally  Dennis  (1965,1967)  hae  dons  a  similar  evaluation  of  a 
number  of  single  word  diatribution  measures,  and  aeveral  of  Che  mea¬ 
sures  we  tested  were  suggested  by  her  work.  Her  results  are  compared 
with  oura  at  appropriate  poinca  latar  In  this  report. 

It  ie  interesting  to  nota  the  relationship  between  the  approach 
used  by  Sally  Dennis  and  us  in  generating  an  indexing  vocabulary  and 
the  approach  suggested  by  Kdaundeon  and  Wyllys  and  used  by  Damerau  in 
automatic  indexing  of  individual  documents.  In  the  automatic  indexing 
situation,'  a  word  is  assigned  to  a  document  if  its  relative  frequency 
in  that  document  la  significantly  higher  than  its  ralatlva  fraquancy  in 
the  document  collection  as  a  whole  or  the  language  os  a  whole.  If  a 
given  word  is  assigned  by  this  process  to  several  documents  beeaufe  of 
its  above -average  relative  frequency  In  them,  then  there  mat  alao  be 
a  number  of  documents  in  which  the  ralatlva  fraquancy  of  the  word  la 
lower  than  the  average.  Thus,  if  the  average  is  aaull,  there  will  be 
a  large  number  of  documents  in  which  the  word  doesn't  appear  at  all, 
and  wa  have  the  situation  depicted  in  the  lower  graph  of  Figure  2-2. 

A  word  with  such  a  distribution  would  presumably  be  included  in  the 
Indexing  vocabulary  by  the  procedures  used  by  Dennis  or  us. 

Vs  were  also  interested  in  exploring  the  use  of  statistical 
association  measures  in  vocabulary  generation.  Me  had  in  mind  a  pro* 
c<  is  In  which  a  small  "kernel"  see  of  subject  eree  terns  would  be 


expanded  «•  *  non  nearly  Mfrilutilvi  veeabulaty  tet  the  field  hy 
successive  additions  of  Unu  closely  raiated  Co  Cana  already  in  tha 
aat.  Before  describing  chit  procaaa  in  detail,  ic  will  be  useful  Co 
define  -inrf  focus  on  a  concept  which  ia  applied  hare  and  in  stany  ocher 
situations. 

The  concept  la  word  inference,  vhich  we  define  Co  ba  a  napping 
from  one  aaC  of  Caraa  to  another  aat  of  tarns  (called  the  inferred  aat)* 
The  original  aat  nay -hove  structure,  such  ea  weights  assigned  to  its 
nomb«ra  or  logical  connectives  joining  them.  Tha  stepping  stay  involve 
tha  Matures  of  association  discussed  earlier,  or  My  utilise  word 
classification  tables  froai  any  source.  H.  I.  Stiles  (1961)  used  word 
inference  in  tha  following  ways  given  a  aat  V  of  worde  (index  terns), 
a  now  word  x  is  in  the  inferred  act  V1  if  ita  steaaure  of  association 
with  a  certain  fraction  of  tha  worde  in  V  la  y renter  than  a  certain 
value.  Bence  tha  uaa  of  association  measures  or  classification  tablet 
to  modify  retrieval  requests  can  ba  fit  into  tha  fraMvork  of  word 
inference ;  tha  supping  in  this  caaa  la  iron  tha  tanas  in  tha  currant 
request  to  suggested  additions.  Also,  most  experiments  with  automatic 
classification  or  "clumping"  of  tha  words  in  a  vocabulary  utilise  word 
Inference.  Tha  process  of  word  inference  ia  quite  a  common  activity 
In  autOMtle  Information  systems. 

Word  inference  ia  tha  baalc  operation  in  tha  expansion  procaaa 
vhich  we  investigated.  Given  a  aat  of  terms  called  the  kernel  and  a 
large  aat  of  non-kernel  tame,  tha  word  inference  mapping  la  computed 
in  tha  following  way:  first,  tha  measure  of  association  between  a 
non -kernel  word  and  each  kernel  word  ia  obtained.  Tha  sum  of  these 
measures  and  tha  sum  of  the  squares  of  tha  Maaures  are  then  computed 
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It  sh*  saa  sKcasds  one  threshold  sr  the  i»  si  squarss  »»•«!•  sheik**} 
then  the  non-kernel  word  le  In  the  Inferred  set.  After  thle  process 
la  repeated  for  each  non-kernel  word,  the  reaultlng  Inferred  set  le 
added  to  the  kornel  act,  and  the  whole  proceea  cen  begin  egeln.  For 
each  non-kernel  word  which  dooa  not  peea  either  of  the  threahold  teete 
during  en  iteration,  the  aw  and  sum  of  aquiree  ere  saved  eo  that  in 
tha  next  iteration  only  the  associations  with  the  words  Just  added  to 
the  kernel  aet  need  be  cowputo.i.  The  reason  for  having  two  threshold 
teats  is  the  following:  If  a  word  has  associations  with  a  large  num¬ 
ber  of  kernel  word#,  it  will  be  likely  to  peae  the  euaa  teat,  whereas 
if  it  has  unusually  high  associations  with  a  aaall  ntaabar  of  kernel 
words,  it  nay  pass  the  sun  of  squares  teat  bet  not  the  sun  test.  The 
relation  between  the  two  thresholds  esq  be  act  so  as  to  favor  ooa  sit¬ 
uation  or  the  other, 

Wfe  envisioned  that  this  expansion  process  could  be  used  in 
conjunction  wit^  the  single  word  statistics  approach  to  vocabulary 
generation.  The  original  kernel  aet  night  be  derived  by  using  single 
word  distribution  statistics,  for  axeapls.  After  several  iterations 
o L  the  expansion  process,  we  would  have  s  Larger  Interrelated  eat  of 
tense  which  could  fora  an  indexing  vocabulary. 

This  expansion  process  is  closely  related  to  the  process  of 
forning  clumps  (Meedhea  (1962))  and  the  use  of  the  B-Coeffielent  ex¬ 
plored  by  Blinn  A.  Salisbury,  Jr.,  and  U.  Edmund  Stiles  (1967).  Both 
of  these  methods  require  a  matrix  of  association  or  correlation  meas¬ 
ures  for  all  tha  terms  of  interest,  and  when  considering  adding  a  tens 
to  a  group  use  a  function  of  both  its  associations  with  terms  in  tha 
group  and  its  association  with  all  terms  not  in  tha  group.  The  expan- 
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sion  proaea*  described  above  require!  lass  total  computation  than  the 
clumping  or  B-Coefflclcnt  techniques,  since  It  uses  only  measures  of 
association  between  tha  candidate  tare  and  the  terms  in  the  group.  If 
there  era  M  words  in  Che  final  (expanded)  kernel  aet  and  K  word a  alto¬ 
gether  the  expanaion  proceaa  will  have  required  the  confutation  of  MM 

2 

associations,  rather  than  N  »  On  the  other  -h/ind,  tha  aaeocieclone  be¬ 
tween  a  given  non-kernel  word  and  all  the  other  non-kernel  word#  way 
contribute  to  a  batter  declaion  on  whether  to  odd  tha  word  to  the  kernel 
or  not.  An  intereeting  topic  for  future  investigation  la  the  queetlon 
of  whether  or  not  Che  aua  or  eua  of  aquarea  of  associations  with  other 
non-kernel  words  can  ba  asciaatad  from  data  like  frequency,  single  word 
distribution  statistics,  and  sun  and  aun  of  aquarea  of  associations  with 
karnel  words. 

Anothar  characteristic  of  terwa  which  w*  would  Ilka  to  ba  able 
to  determine  statistically  is  the  degree  of  generality  or  specificity 
they  poseaas.  The  division  of  specialty  words  into  cora  words  and 
particular  words  on  tha  basis  of  frequency  la  a  first  approximation  to 
thia,  alnca  wa  expect  that  more  general  terms  will  oftan  be  more  fre¬ 
quent  in  e  given  collection.  However,  the  correlation  between  frequency 
end  degree  of  generality  is  not  complete,  alnca,  a*  we  pointed  out 
earlier,  e  general  term  for  an  araa  broader  then  the  area  covered  by 
tha  given  collection  will  oftan  appear  with  the  frequency  of  a  partic¬ 
ular  tana  in  the  more  restricted  field.  Hence  our  experiment  included 
an  evaluation  of  single  word  distribution  measures  aa  discriminators 
between  general  and  specific  terms. 

This  section  has  described  some  types  of  statistical  processing 
of  text  which  can  be  performed  by  computer.  Our  goal  ie  to  find  static- 


cicol  Loohniques  whloh  will  help  us  t»  dissever  «hd  fslslions  «swna 
t  >rds  and  interpretations  of  words  used  in  an  area  of  apeeialltati an, 
Xbo  following  scetion  will  describe  our  investigation  of  sow  specific 
statii.tj.cai  techniques. 


3.  AN  EXP&KlKiiNT  IN  V.-ui  USE  OF  STATISTICAL  T8CHMIQL' >1 

3 . 1  Introduction 

An  experiment  i..k.  conducted  to  test  some  of  the  hypotheses 
about  language  usage  mentioned  earlier  Ko  demonstrate  the  feasi¬ 
bility  of  using  statistical  technique  th  generation  of  semantic 
tools.  Ths  two  tjics  investigated  were  semi-automatic  generation 
of  a  subject  aefs  vocabulary,  and  (2)  semi -automatic  generation  of 
classification  tables  for  that  area.  In  particular,  the  first  pert  of 
the  experiment  was  a  comparative  evaluation  of  several  measures  rtf  the 
distribution  of  words  among  documents  as  discriminators  between  spe¬ 
cial  ty  (informing)  words  and  non-specialty  (uniforming)  words  for  the 
subject  area.  The  second  main  task  of  the  experiment  was  the  evalua¬ 
tion  of  two  applications  of  statistical  association  measures,  one 
involving  their  contribution  to  vocabulary  generation  and  the  other 
involving  the  use  of  classification  tables  based  on  statistical  asso¬ 
ciation.  It  was  our  hope  that  the  statistical  data  obtained  from  a 
uet  of  sample  documents  could  be  used  to  obtain  an  indexing  vocabulary 
and  classification  tables  useful  not  only  in  processing  the  sample 
documents,  but  also  in  processing  new  documents  from  the  same  subject 
area.  The  rest  of  this  section  will  enumerate  the  measures  which  were 
tested  and  describe  the  procedures  used  in  their  computation. 

3.2  Single  Word  Statistics 

The  single  word  distribution  measures  whicb  were  computed  in¬ 
cluded  the  following: 

AFOC(j)  ■  Absolute  Frequency  of  Occurrence  of  word  j  in  the 
Collection,  i.e.,  number  of  tokens  (occurences) 
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corresponding  to  the  j  -ore  iypm. 

AF0D(J,d)  -  Absolute  Frequency  of  Occurrence  of  word  J  in 
Document  d. 

RFOD(J,d)  ™  Relative  Frequency  of  Occurrence  of  ward  J  ir. 

Document  d , 

*  ^^(d)  * *  wliere  ^(d)  i*  the  length  of  document  d, 
i.e,,  total  number  of  tokens  in  d, 
LFOD(J,d)  *  Log-normalized  Frequency  of  Occurrence  of  word  J 

in  Document  d, 

.  Apppqd? 

log10(L(d))‘ 

MAFOD(j)  “  Mean  of  the  values  of  AFOD(J,d)  for  all  documents 
d  in  the  collection, 

■  ~  AFOC(J),  where  N  is  tho  total  number  of  documents 
in  the  collection. 

VAFOD(J)  -  Variance  of  the  values  of  AFOD(l.d)  for  all  d  in 
the  collection, 

1  N 

-  £  2.  AFOO(j,d)Z  -  MAFOD(J)  , 

d-1 

(The  function  actually  used  was  the  unbiased  esti¬ 
mator  of  the  population  variance: 

Mil  2  AFOD(J.d)2  -  ~  MAFOD(j)2  ,) 

TAFOD(J)  -  Third  moment  about  the  mean  of  the  value#  of  AFOJ(l,tl) 
for  all  d  in  the  collection, 

N 

"m2.  AFOD(j.d)3  -  3*T(j)*MAF0D(J)  +  2HAFOD(J)3, 
d-1 

where  T(j)  is  the  second  moment  about  zero,  i.e.. 
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1  N  2 

T<J>  "  ^  21  ATOD<J,d)\ 
d"l 

(The  unbiased  form  1b 

CiTTRiw)  k  *"D<i-d>3  '  >•&•!’ OJimodO)  + 

0*1 

2  N 

(F-Y)~(N^2)  wh«*  *’<J>  m^i  T  APOD(J,d)2.) 

d*l 

CAFOD(J)  -  Gamma  or  coefficient  of  skewness  of  the  values  of 

AFOD(J,d)  for  all  d  in  the  collection! 

TAFQDCO 
“  VAFOD(j)J^2 

MRFOD(J)  »  Mean  of  the  values  of  RFOD(l.d) 

1  N 

-  5  T  RFODO.d) 

d"l 

VRFOD(J)  ■  Variance  of  the  values  of  RFOD(l.d) 

TRFOD(j)  *  Third  moment  of  the  values  of  RFODQ.d) 

GHFOD(J)  -  Gama  or  coefficient  of  skewness  of  the  values  of 
RFOD(j.d) 

MLFOD(j)  -  Mean  of  the  values  of  LFOI>(j,d) 

VLFOD(J)  "  Variance  of  the  values  of  LFOD(l.d) 

TLFOD(j)  -  Third  moment  of  the  values  of  LFODfl.dl 
GLFOD(j)  “  Gamma  or  coefficient  of  skewness  of  the  values  of 
LFOD(l.d) 

AFSG(J)  ■  Absolute  Frequency  on  Sentence  basis  of  word  j  la 
the  Collection,  i.e.,  number  of  sentences  In  which 
the  jth  word  type  appeared. 

AFDC(j)  ■  Absolute  Frequency  on  Document  basis  of  word  j  in 

the  Collection,  i,e.,  number  of  documents  In  which 

fcti  * 

the  j  type  appeared. 
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Hany  of  the  above  measures  watt  proposed  ana  tattoo  by  lolly 

Donnla  (1965).  An  article  by  Fred  Daaerau  (1965)  reporting  that  a 
measure  baaed  on  the  Poisson  distribution  was  quite  successful  In  an 
automatic  indexing  experiment  stimulated  us  to  define  and  test  the 
following  two  new  measures; 

VPLFOD(J)  -  Variance  with  Poisson  normeliaetion  of  the  values 
of  LPOD( j ,d) , 

VLFOD  0) 

"  K*  Ep (VLFOD  (j) ) 

_  VLFOD (j) 

AFOC(j)  ’ 

where  B  (VLFOD(j))  is  tlie  expectation  of  VLIDD(J)  If 
P 

AFOD(J,d)  had  a  Poisson  distribution  and  all  docu¬ 
ments  wore  of  equal  length;  wo  will  shortly  derive 
the  fact  that  tills  is  proportional  to  AFOC(J) . 

S(J)  ■  Size  of  a  document  collection  constructed  by  taking 
all  documents  for  which  AFDD(j,d)  is  not  zero  and 
adding  to  them  enough  documents  for  which  AFOD(J.d) 
is  zero  to  make  the  resulting  set  of  AFOD's  most 
nearly  fit  the  Poisson  distribution. 

The  measure  VPLFOD(*)  is  1/K  times  the  ratio  of  an  actual  vari¬ 
ance  to  the  expectation  of  that  variance,  but  wea  computed  as  the  ratio 
of  an  actual  variance  to  an  actual  frequency,  A  proof  of -the  propor¬ 
tionality  of  the  expected  variance  (VLFOD ( • ) )  end  the  actual  frequency 

AFOC(*)  follows: 

Suppose  AFOD(J.d)  has  a  Poisson  distribution.  Then 

k 

?(AFOD(J,d)  -  k)  -  e”“  rh*  ,  for  k  -  0,1,2,..., 

He 

using  P(v)  to  represent  the  probability  of  event  v.  The  parameter  ■  in 
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the  aliow#  forwiln  is  both  the  ussttR  sad  th*  variance  at  Che  ruisson 
distribution.  Let  RPOC<J)  stand  for  tbs  relative  frequency  of  occur¬ 
rence  of  word  j  in  the  collection,  i.e.,  where  L(C)  is  the 

icnj;i:h ' of  the  text  in  the  entire  collection  (nuaber  of  word  tokens  i'n 
the  collection).  Then  it  muer  be  the  case  thas 

•  -  *<j,d)  -  »oc<j>a<d>. 

If  there  are  8  documents  of  equal  length  in  the  collection,  then  L(d)  » 
and  L(C)  •  ML,  Hence 


...  erwii’,1 

■  -  ■ — j(  ’ 

It  the  variance  of  the  distribution  of  absolute  frequanclea  is 
the  variance  of  log-noraaLlsad  frequencies  will  be  {log  ^(l))^ 
aince  UOD(J,d)  "  log^^L)^ ‘  Tbu*  *xPcctod  variance  will  be 

«p<VLK»0»  -  •  iSS“'  TT"  “ 

B  <VLTOD(j»  -  J  ATOC(J)  . 

The  measure  VPLFOD  (•)  la  quite  similar  to  the  iMasure  with 

which  Sally  Dannie  (1967)  haa  had  the  greatest  success  in  discriminating 

rocc 

between  informing  and  non-informing  words.  This  measure,  called  •~By 
In  her  terminology,  la  the  relative  frequency  analogue  of  VFLIOD(«)» 
which  la  based  on  log-normalised  frequencies.  The  two  measures  esn  be 
proved  proportional  under  the  assumption  that  all  M  documents  have  the 
same  length  L.  NOCC  is  ATOC(*) •  and 
_  MRFOP(>)2 


vapoD(-) 

By  the  equal  length  aasunption, 


RFOD(J,d)  and 

HUTOD(J)  -  £  HUOD(J)  -  j~ j  AfOC(J) ,  and 


M. 


YWO»<j) 

lionco, 


K  VAVDD(j) 
LZ 


(leg^a))1 


VLTOD(J). 


NOCC  ^  VRVOP( j) «AFOC( j) 

EX  1 

— 5— S'  vA.'Gwxj/) 

L*NZ 

(log10(L))2  VU?0&(j) 

*4  ATOC(j) 

HZ 

The  Mature  S<J)  ie  an  estimate  of  the  alae  (number  of  docu¬ 
ment*)  of  cha  hypothesized  document  collection  for  which  the  observed 
non-zero  values  of  AFOD(j,d)  would  swat  nearly  fit  the  Poisson  distri¬ 
bution,  (In  order  to  treat  all  the  values  of  AFQD(j,d)  as  staples 
froa  a  single  distribution,  it  is  necessary  to  aaka  the  approKlmatloa 
that  all  the  documents  ere  the  same  length.)  S(j)  could  be  computed 
in  at  least  two  ways,  both  of  which  begin  by  segregating  the  non-sero 
values  of  absolute  frequency  (AFOD(j,d»  from  the  zero  values,  and 
making  an  estimate  ^  of  tha  Poisson  parameter  ■  from  the  non-sero 
values.  One  method  than  calculates  tha  probability  that  AP0D(J,d) 
would  be  zero,  using  tha  Poisson  formula  P(AF0D(J,d)-0)  ■  e*^j.  Vs 
can  visualize  that  this  probability  ia  used  to  obtain  the  r unbar  of 
zero-AFOD  documents  which  toother  with  the  non-sero  values  would  moat 
closely  fit  the  Polseon  curve.  Actually,  tha  fovnaila  for  this  estima¬ 
tor  of  the  new  collection  else  is  8'(J)  ■  ATDC(j)  4  8* (j) *e"^J*  Bence, 

S'  (J)  -  MP9Q) .  The  other  method  is  computationally  simpler,  howwvcr, 
l-a"^J 

Bacall  that  for  the  case  of  a  Poisson  distribution  of  within-docuaent 
frequencies  for  T  documents  of  equal  length, 

.  .  *S2=£J2,  1 .  fflgai. 

In  the  present  case  wa  are  assuming  that  all  tha  occurrencee  of  term  j 
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ara  acaaaatsaua  la  a  nufeas lias tics  assaiatiag  :i  dasusaaia  aad 
that  in  this  subcollaction  tfaa  within -doc unset  frequencies  art  approx¬ 
imately  Foiaaon  distributed.  Hanca  this  estimate  of  tba  naw  collection 
sixo  is 

Wj 

This  la  tha  formula  which  was  actually  used,  though  aowa  staple  compu¬ 
tations  ahowad  that  tha  two  tornulaa  gava  vary  cloaa  valuta  on  raal 
data  (whara  tha  docusMnta  wars  not  of  constant  length). 

Moat  of  tha  coasaon  estimators  for  tha  Foiaaon  parameter  a,  such 
as  tha  aaapla  aean,  dapand  on  tha  nuabar  of  docunanta  for  which  AFOD(J,d) 
■  0,  as  wall  as  tha  noo-saro  valuta  of  AP0D(J,d)»  Our  astlMta  ^  of 
tha  paraoatar  ■  was  obtained  froai  tha  noo-saro  valuta  a Iona  la  tha  fol¬ 
lowing  way: 

For  a  Poisson  distribution  with  paraaatar  a, 

P 00  -  s’"  and  P(k-1)  -  a"*  . 

Hanca, 


kP<k) 

FOt-1) 


Lat  Dj(k)  be  tha  nuabar  of  docunanta  for  which  AF0D(J,d)  -  k  in  aosw 

collection,  and  lat  T  ba  tha  nuabar  of  docunanta  in  that  collection. 

Than  if  AF0D(J,d)  has  a  Foiaaon  distribution,  P(AF0DOi<*)"fc)  nan  ba 
n  (kl 

approximated  by  1 v  .  and  a  can  ba  aatlaatad  by  using  ratios  of  tha 
T 

k  n^Ot) 


fora 


n.  (k) 


,  i.a. 


H*  were  looking  .  i  an  estimate  of  a  which  doesn’t  depend  8S  8^ (S) i  8S 
v*  night  have  taken  a  weighted  average  of  theie  ratios  for  k  *  2,3,4,.,. 
up  to  the  lariat  value  of  k  for  which  rt^OO  i  0.  The  ratio,  however, 
la  undefined  when  n,(k-l)  »  0.  ono  aolutlon  to  thia  problem  la  to 
smooth  the  values  of  n ^  (k)  such  that  they  are  monotone  decreasing  to 
the  right  of  the  peek  ee  k  increase*.  The  algorithm  we  used  to  accomp¬ 
lish  this  i#  illuatratod  In  Figure  3-1.  The  veluee  resulting  froai  thla 
algorithm  were  then  used  in  an  eatimata  ft.  of  m  obtained  by  the  formula 
kJ  ka'00 


k  ■  2 


CJCk>  n^'ik-T) 


where  a  *(k)  ■  number  of  documents  for  which  AFOO(j,d)  ■  k, 

^  A 

efter  smoothing  so  that  n^’Oc)  i  0  for  k  <  kj, 


c,<k)  « 


'<k-l)  +  »1 ' <k> 

ab-n^O))  *  n^d)  -  nj '  (^j)  ’ 


kj  -  largest  k  such  that  n^ ' (k)  >  0,  end 
M  -  number  of  documents  In  original  collection. 


Mote  that 


51  CJ°°  "  l* 

k  -  2 

3.3  Word  Pair  Statistic* 


The  second  main  type  of  statistical  processing  performed  in 
the  current  experiment  war  statistical  word  association,  a  two  step 
process  for  determining  the  extant  to  which  words  co-occur  in  a  given 
collection  of  text.  The  first  step  in  etatiscicel  word  association  is 
computation  of  co-occurrane*  statistics,  and  the  second  it  computation 


of  measures  of  association.  The  text  input  for  the  firat  step  consist* 
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Wntet  r.  ■  auhaoript  j  la 
dropped ‘ throughout  this  flowchart* 


I  Sec  Peak  to  the  value  of  k  which  evixlntltes  n(k)  for  K;2., 
Set  :t 1 1 1 to  the  largest  k  such  that  n(k)>D* 

Set  n*0)  to  n{k>  for  all  ks.Htlm. 


r 


|  I  -  Peak  j 

-  •  .i 

.n»(I+l)d  n'<I)f  '.v 

.  ’‘.I.**. 


1  Dlff  -  n’O+l)  -  n*(  I ) 

n'<I+l)  -  n*(  1  +  1)  -  .0*DlCf 

i  jrt'(X)-n'<I)  +  .4*Dlff-n»(I*l> 
1  ! n'(I+2)  -  n'CI+2)  +  ,2+Diff 


L. 

N 

t  -  i  ♦  V! 


—  h 
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Figure  3-1.  Algorithm  for  Smoothing  the  Valuta  i»j(k)  of 
the  Frequency  Function  for  the  Wlthin-Docuaent  Frequencies, 
AFOO(J,d),  So  That  tha  Resulting  Values  nj'(k)  are  Mono¬ 
tone  Decreasing  to  the  Right  of  the  Peak. 


•(  e  tot  «f  wrt»  »rt»r«i  In  to*  it  win  strlnge  **  tnulMl  la  i 

nuaber  of  overlapping  sets.  A  co-occurrenct  statistic  1*  slaply  a 
function  of  tho  Joint  occurrences  of  two  or,  nor*  word*  la  tha  input. 
An  cxnmpla  of  a  co-uccurrenco  statistic  would  be  the  nuaabar  of  time 
wo.-..  .  .  iui  word  j  occur  at  distance  d  from  one  another  In  a  string  of 
text.  Tha  second  step  in  statistical  word  'association  la  the  Quo¬ 
tation  of  a  naasura  of  asaooiatioa  hotwoon jtwe  eete  of  words,  heeed 
on  co-occurrence  statistics  involving  the  words  in  these  sets.  The 
tuna  sure  of  association  between  the  set  containing  word  1  and  tha  eat 
containing  word  j,  for  example,  night  bo  e  coabinatlon  of  e  nuaber  of 
co-occurrance  statistics  for  the  pair  i,j,  divided  by  Mae  noraelisa- 
tlon  factor,  based  on  tie  nusabar  of  individual  occurrences  of  words 
1  and  J.  We  are  also  interestsd  in  second  order  associations,  sine* 
synonyas  are  expected  to  have  high  association  of  thia  type*  If  B 

is  the  matrix  of  second  order  associations  end  A  is  the  first  order 

2 

aatrlx,  the  B  -  A  . 

The  foraula  we  used  for  neasure  of  association  le  one  inves¬ 
tigated  by  Paul  Jonas  sod  Robert  Curtice  (1967) : 


A<i.J> 


JTscq.j) 

A?SC(i)p*AFSC(j)4 

where  A(l,j)  le  the  neasure  of  association  of  words 
1  and  j,  JTSC(i.J)  is  the  Joint  Frequency  of  1  end 
J  on  a  Sentence  basis  in  the  gollectlon,  l.e, ,  the 
nuaber  of  sentences  in  which  both  i  end  j  occur, 
AFSG(l)  is  the  nuaber  of  sentences  in  which  i 
eppeere,  as  defined  earlier,  sad  p  +  q  «  1  (typical 
values 'ere  p  »  q  ■  ,5). 
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3.4  Implcmantatlon  of  Statistical  Computations 

The  computation  of  * ingle  word  distribution  statistics  end 
word  association  statistics  in  tho  current  experiment  was  based  on  a 
cor.ceru&ncs  of  the  input  test.  The  concordance  contained  an  entry  for 
each  occurrence  of  each  word,  giving  the  word  type  and  the  "coordinates" 
of  that  occurrence,  namely,  documsne  number,  paragraph  number  within 
document,  eentence  number  within  paragraph,  and  word  number  within 
aanteoce.  The  entries  were  arranged  alphabetically  by  word  type  and 
within  a  single  type  in  order  of  occurrence.  The  context  eccompanylng 
each  occurrence  wae  not  explicitly  stored. 

There  era  several  advantages  in  using  a  concordance  as  a  base 
for  generetlng  statistical  data  on  words.  One  advantage  is  the  flex¬ 
ibility  which  xeeults  from  having  data  stored  in  an  intermediate  form. 
One  concordance  esn  be  used  for  computing  e  variety  of  single  word 
statistics  or  measures  of  association  between  word  pairs  in  less  time 
than  the  rescanning  of  tbs  original  caxts  would  require.  Hence  the 
concordance  le  an  excellent  been  for  experimental  studies. 

In  addition,  a  concordance  can  be  useful  when  human*  ere  editing 
computer-produced  statistical  date,  since  it  serves  as  an  elementary 
information  retrieval  system.  Sentences  in  which  s  highly  associated 
pair  of  words  co-occur,  for  exeaple,  cen  easily  be  located  and  dis¬ 
played. 

Tbs  concordance  has  e  further  advantage  relevant  to  ccmputetlon 
of  association  measures.  One  of  the  problems  in  working  with  word 
associations  is  the  quantity  of  data  that  can  be  generated.  Xn  partic¬ 
ular,  for  N  word  types,  there  are  about  W*/2  word  pair*  whose  associa¬ 
tion  measure  could  be  computed,  assuming  a  symmetric  measure.  In  the 


currant  experiment  there  were  about  A, 000  types,  and  haua  about 
18,000,000  potential  association  measure*.  The  groat  majority  of  tbaaa 
would  have  been  aero,  but  even  if  only  two  per  cent  were  non-iero,  tha 
storage  required  for  them  would  be  quite  large,  the  concordance  ap- 
pronch  permits  soma  selectivity  In  this  situation.  Since  the  measure 
of  association  between  two  words  can  be  computed  fairly  easily  and 
directly  from  the  concordance,  it  is  not  necessary  to  compute  at  one 
time  all  the  association  measures  that  will  ever  be  needed.  With  a 
concordance,  the  atrategy  of  computing  the  associations  between  per¬ 
haps  several  hundred  words  and  saving  these  along  with  the  concordance 
is  possible.  Figure  3-2  shows  the  storage  requirements  for  different 
approaches  to  word  association. 

Figure  3-3  is  a  block  diagram  of  the  computations  performed  In 
the  current  experiment.  The  input  text  was  a  set  of  217  reviews  of 
documents  in  the  computer  programming  field  from  1962,  1964,  and  1966 
issues  of  the  A.C.M,  publication  Computing  Reviews.  The  text  contained 
69,497  word  tokens  and  6,405  word  types,  of  which  2,920  occurred  only 
onec.  The  text  was  keypunched  end  the  punched  cards  were  input  to  a 
scanning  program  which  isolated  the  Individual  word  tokens  and  asso¬ 
ciated  the  appropriate  positional  information  (coordinates)  with  each. 
The  scanning  program  utilised  a  set  of  text  analysis  routines  written 
for  the  IBM  7094  by  Ian  C.  Ross  of  Bpll  Telephone  Laboratories,  Murray 
Hill,  New  Jersey.  Following  the  scanning,  an  alphabetic  sort  by  word 
type  produced  the  concordance.  (Actually,  because  of  the  quantities 
of  data  which  could  be  sorted  efficiently  at  one  time,  a  concordance 
for  1962  and  1964  was  generated  separately  from  1966,  and  the  two  con¬ 
cordances  were  then  merged.)  From  the  concordance  a  count  was  made  of 
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Explanation: 

T  is  a  rough  estimate  of  the  number  of  word  typos  for  a  text  of 

moderate  homogeneity;  assuming  each  type  is  stored  in  one  computer 
word,  T  Is  the  storage  required  for  word  tvpes. 

2 

.02T  l 5  a  conservative  estimate  of  the  number  of  non-zero  associations 
between  all  word  types.  If  the  association  measure  is  symmetric, 
there  will  only  be  half  this  many  elements  to  store;  if  each  element 
requires  two  computer  words,  one  for^an  association  measure  and  one 
to  identify  the  two  types,  then  ,02TZ  is  the  storage  space  required. 

C  is  the  storage  required  for  a  concordance  of  the  input  text,  assuml  g 
2  computer  words  are  used  per  token. 

2  2 

C  +  200  and  C  +  400  are  the  storage  requirements  for  a  concordance 
plus  all  associations  between  200  and  400  words  respectively. 


Figure  3-2.  A  Comparison  of  Storage  Requirements  for 
Different  Approaches  to  Statistical  Word  Association. 
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Figure  3-3.  Block  Diagram  for  Statistical 
Processing  Performed  in  the 
Current  Experiment. 


the  frequency  (number  of  tokens)  for  each  word  type  which,  after  sor¬ 
ting,  resulted  in  a  frequency  ordered  list  of  word  types. 

Two  procedures  were  used  to  reduce  the  amount  of  data  to  be 
ptocuajicu  by  Che  statistical  programs:  deletion  of  common  words  and 
deletion  of  low  frequency  words.  The  exclusion  of  common  words  which 
were  on  a  lis-t  obtained  automatically  was  intended  as  a  substitute  for 
deletion  of  function  words  on  a  list  generated  manually.  The  sample 
of  "general  literature"  which  served  as  our  source  of  common  words  was 
about  four  million  words  of  Time  Magazine  text  from  1963  and  1964.  He 
obtained  magnetic  capes  containing  tlus  text  and  a  dictionary  for  it 
with  frequencies  from  Luther  Haibt  of  XBM  and  Margaret  Fischer  of  Time. 
(Their  use  of  this  date  was  described  by  the  presentation  of  Haibt, 
Fischer,  'letelhut,  and  Ogg  (1967).)  He  sorted  the  dictionary  by  fre¬ 
quency  and  chose  to  define  as  a  common  term  any  word  which  had  more 
than  1300  occurrences  in  the  Time  sample.  There  were  251  such  words, 
after  geographical  names  had  been  omitted.  He  tagged  seven  of  these  as 
'.ontrol  terms  which  were  not  to  be  deleted  from  the  Computing  Reviews 
concordance,  since  we  wanted  to  see  how  their  statistics  would  compare 
with  the  statistics  of  core  terms.  Of  the  remaining  244  terms  on  the 
Time  common  term  tape,  217  appeared  in  the  Computing  Reviews  concord¬ 
ance  and  were  deleted.  These  terms  accounted  for  30,647  tokens  (about 
half  of  the  original  text).  The  seven  common  words  which  were  not 
excluded  accounted  for  another  2,805  tokens.  At  the  same  time  we  dele¬ 
ted  very  low  frequency  words  from  the  concordance,  according  to  the 
rule  that  a  word  was  dropped  if  it  didn't  appear  at  least  six  times  in 
the  text  and  at  least  twice  some  review.  This  procedure  resulted 


in  the  elimination  of  5,204  word  types  and  11,446  tokens.  Thus  the 


"reduced"  concordance  used  for  Che  statistical  processing  contained 
984  word  types  and  the  coordinates  for  the  corresponding  27,404  tokens. 

The  first  step  in  obtaining  single  word  distribution  statistics 
from  the  reduced  concordance  was  a  tabulation  of  withln-document  fre¬ 
quencies  for  each  of  the  984  terms.  These  frequency  counts  were  stored 
on  tape  so  that  they  would  be  available  later  if  we  should  want  to 
compute  new  types  of  single  word  distribution  measures.  The  measures 
computed  in  the  current  experiment  were  the  ones  defined  earlier  In 
this  section.  The  results  were  stored  on  tape,  and  the  ordering  of 
the  984  terms  according  to  their  distribution  measure  values  was  ob¬ 
tained  for  each  measure  by  sorting.  The  orderings  were  printed  out 
and  evaluated  aa  described  In  the  following  section. 

The  single  word  distribution  measures  were  also  used  in  the 
generation  of  three  seta  of  particular  terms  used  in  the  compilation  of 
statistical  association  data  for  classification  tables  to  be  used  in 
man-machine  dialogue.  This  process  is  described  in  more  detail  in  the 
following  section.  The  other  application  of  statistical  association 
data  which  was  tested  in  the  current  experiment  was  the  expansion  pro¬ 
cess  using  word  Inference,  which  was  explained  in  the  previous  section. 
For  both  of  these  applications  association  measures  were  computed  di¬ 
rectly  from  the  concordance.  To  find  the  number  of  sentence  co¬ 
occurrences  of  two  words,  the  association  program  scanned  in  parallel 
fashion  through  the  concordance  entries  for  the  two  words  looking  for 
a  match  between  Che  first  three  coordinates  of  occurrences  of  the  two 
words,  i.e.,  document,  paragraph,  and  sentence  numbers.  When  a  match 
was  found,  the  co-occurrence  counter  was  Incremented  by  one.  (Multiple 
appearances  in  the  same  sentence  were  counted  only  once.)  The  measure 
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of  association  was  obtained  as  the  total  co-occurrence  count  divided 
by  functions  of  the  individual  frequencies,  according  to  the  formula 
presented  earlier* 

The  evaluation  of  the  various  outputs  of  the  statistical  comp¬ 
utation  programs  is  discussed  in  tha  following  section. 


4.  RESULTS 
4.1  Introduction 

The  two  goals  in  tho  present  experiment  were?  (I)  m  Coipparafcive 
evaluation  of  a  number  of  measure!  of  the  distribution  of  single  words 
among  documents,  and  (2)  an  evaluation  of  two  applications  of  statis¬ 
tical  association  measures  for  word  pairs.1  The  first  application  of 
statistical  association  measures  was  the  process  (described  earlier) 
of  expanding  a  small  set  of  specialty  words  into  a  larger,  more  com¬ 
prehensive  vocabulary  for  a  subfield,  by  successive  additions  of  words 
closely  associated  with  words  already  in  the  set.  The  other  applica¬ 
tion  of  statistical  association  measures  was  their  use  in  adaptive 
man-machine  interaction  in  the  process  of  formulating  an  information 
request.  This  subject  la  treated  in  more  detail  by  John  S.  Edwards 
(1967). 

There  are  two  characteristics  of  single  words  which  we  would 
like  to  be  able  to  determine  statistically  from  their  distribution 
among  documents:  (1)  whether  a  word  is  a  specialty  term  or  not,  and 
(2)  if  a  word  is  a  specialty  word,  whether  it  is  a  general  term  or  a 
specific  one  within  the  vocabulary  of  the  specielty.  We  can  use  the 
first  characteristic  (specialty-non-specialty)  in  forming  a  vocabulary 
for  a  subject  area.  The  second  characteristic  can  be  used  in  generating 
a  classification  or  hierarchy  for  the  words  in  a  vocabulary,  or,  more 
generally,  in  forming  classification  tables  for  the  vocabulary.  Xn 

1.  We  did  not  attack  the  problem  of  comparing  different  functions  us 
measures  of  the  association  between  two  words.  This  problem  has  been 
treated  mathematically  by  Vincent  Giuliano  (1965)  and  J.  L.  Kuhns  (1965), 
and  an  enlightening  experimental  comparison  of  differant  meaauraa  has 
been  reported  by  Paul  Jones  and  Robert  Curtice  (1967). 
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addition,  generality  la  an  important  parameter  affecting  the  choice  of 
atatiatlcal  aaiociation  aeasurea,  according  to  Jonea  and  Curtice  (1967). 
He oca,  we  evaluated  each  distribution  measure  for  it*  ability  to  soke 
the  two  types  of  separations.  It  was  originally  our  idea  (Rubinoff 
and  Stone  (1967))  to  use  single  word  distribution  data  first  to  seg¬ 
regate  specific  specialty  words  from  general  specialty  words  and  non- 
specialty  words,  Including  common  words,  and  then  to  eeperete  the 
general  specialty  words  from  the  non-specialty  words.  As  the  follow¬ 
ing  discussion  of  results  will  make  clear,  we  found  a  better  process, 
which  begins  by  deleting  common  words  end  then  usee  one  distribution 
measure  to  separate  specialty  words  (both  general  and  specific)  from 
the  remaining  non-specialty  words,  and  finally  uses  another  distribu¬ 
tion  measure  to  rank  the  specialty  words  according  to  their  degree  of 
generality. 

The  single  word  distribution  measure*'  defined  in  the  previous 
section  were  computed  for  the  984  words  of  the  Computing  Reviews  text 
which  remained  after  moat  of  the  common  and  very  low  frequency  words 
were  omitted.  The  measures  were  evaluated  in  terms  of  their  ordering 
of  the  984  words,  rather  then  in  terms  of  their  absolute  values  for 
these  words.  Jones  and  Curtice  (1967)  are  advocates  of  this  type  of 
evaluation,  pointing  out  that  the  absolute  magnitude  of  e  measure  for 
one  word  is  not  meaningful  in  isolation;  its  relation  to  the  values 
of  the  measure  for  other  words  is  more  crucial. 

4.2  Specialty  -  Non-Specialty  Discrimination 

The  measures  were  first  evaluated  for  their  ability  to  dis¬ 
criminate  between  specialty  words  and  non-specialty  words.  Six  seta 
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of  words  were  chosen  for  chia  purpose:  a  low- frequency  non-specialty 
sot  (consisting  primarily  of  function  words),  a  low-frequency  specialty 
set,  mid-frequency  non-spccialty  and  specialty  seta,  and  high- frequency 
non-specialty  and  '.laity  acta.  For  each  frequency  range,  non- 
spucialty  and  specialty  words  were  matched  for  frequency,  so  that  there 
vould  be  no  bias  due  to  frequency.  The  choice  of  aon-specielty  or 
function  words,  though  subjective,  was  in  the  spirit  of  Miller,  Newman, 
and  Friedman  (1958).  The  specialty  words  were  picked  rather  arbitrar¬ 
ily.  Two  criteria  were  used  to  judge  the  seperetlng  power  of  the  neas- 
ures.  One  was  the  separation  of  the  average  rank  order  for  non¬ 
specialty  words  from  the  average  for  specialty  words,  while  the  other 
was  the  overlap  of  the  intervals  containing  the  non-specialty  rank 
orders  and  the  specialty  rank  orders.  (Negative  overlap  corresponds 
to  separation.)  We  are  thus  making  the  simplifying  assumption  that  if 
a  measure  is  a  good  discriminator  between  non-specialty  and  specialty 
words,  it  will  tend  to  assign  values  in  one  interval  to  non-apecialty 
words  and  in  another  interval  to  specialty  words.  The  alternative  of 
more  than  two  intervals  (alternating  intervals  of  non-specialty  end 
specialty  words)  was  considered  improbable. 

Table  4-1  lists  the  words  in  each  set  and  Figure  4-1  is  e 
detailed  comparison  of  the  ordering  Induced  by  one  of  the  measures  on 
the  members  of  a  specialty  set  and  a  non-specialty  set.  Each  ray 
terminates  on  the  vertical  line  at  the  rank  order  of  one  of  the  mem¬ 
bers  of  the  set,  and  the  rays  converge  on  the  average  rank  order  for 
che  set.  Table  4-2  summarises  the  rank  order  date  for  each  set  for 
several  of  the  distribution  measures.  It  is  evident  that  the  two  gamma 
(coefficient  of  skewness)  measures  illustrated  end  the  Poisson  measure 


Tabic  4-i .  Word  Sees  used  in  the  Speeiaity-Htm-speeia  iey 
Olscrl ml  nation  Evaluation  of  Distribution  Measures. 
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Low  Frequency  Non-Spcclalty  Words 


Low  Frequency  Specialty  Words 


Sure ly 

Mention 

Instead 

Therefore 

Somewhat 

Rased 

Purpose 

Value 


Considerably 

Following 

Latter 

Perhaps 

Provided 

Reasonable 

Whether 


Algorl thmie 

Online 

Mon  1 1 '  > 

Construction 

Sorting 

Matrix 

Variables 

Syntax 


Compi ling 
T)lr.k 

r/o 

Model 

Problem* oriented 

Processor 

Tape 


Mid  Frequency  Non-Spoclalty  Words 

Possible 

Using 

Thus 

Soor.is 

Describes 

Several 

Very 

However 

High  Frequency  Non-Specialty  Words' 
If 

Also 

An 

Or 

*Thcse  were  four  out  of  the 
seven  control  common  terms. 


Mid  Frequency  Specialty  Words 

Opr  rutnrfl 

Compilers 

Procedure 

Manua 1 

Time-sharing 

Functions 

Method 

Notation 


High  Frequency  Specialty  Words 

Cobol 

Language 

List 

Program 


Note:  The  words  In  the  low  frequency  sets  had  a  total 
frequency  in  the  collection  between  6  and  44,  l.e., 
6 s AFOC( j) ~  44.  For  the  mid  frequency  sets, 

455  AFOC< J)s  90,  and  for  the  high  frequency  sets, 
100- AFOC(j)5  520. 
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MANUAL 

TIME-SHARING 

FUNCTIONS 

METHOD 

NOTATION 


NPN-SrECIALTY  WORDS 

POSSIBLE 
USINC 
THUS 
SEEMS 
DESCRIBES 
SEVERAL 
VERY 
HOWEVER 


l  (high  variance) 
RANK  ORDERS 


Figure  4-1.  A  Sample  of  the  Ranking  Induced  by 
the  Poisson-Normalised  Variance,  VPLFOD(-). 


Table  6- 2a.  Rank  Order  Data  for  the  Sets  of  Non-Spc-e!  *ty  and  Specialty  Terms  cf  Table 
for  Different  Distribution  Statistics. 
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Note:  Negative  overlap  corresponds  .to  separation  of  rank  order  ranges. 


52. 


j(>)  have  fairly  oii.illnr  behavior.  For  all  three  of  thee#  measures 
the  separation  of  average  rank  order  of  non-specialty  from  specialty 
words  is  under  200  for  the  high  frequency  range,  and  for  the  variance 
measure  VLF0D(*)  the  separation  is  26.3.  The  reason  for  these  low 
separa cions  is  probably  the  lack  of  independence  between  these  four 
measures  and  frequency  (AFOC(-)).  In  particular,  the  gamma  measures 
and  the  Poisson  S(*)  tend  to  give  a  high-frequency  word  a  high  rank 
order  independent  of  whether  it  is  a  non-specialty  word  or  a  specialty 
word,  whereas  the  variance  measure  VLFOD ( • )  tends  to  have  a  large 
value  and  hence  assign  a  low  rank  order  to  such  a  word.  The  Poisaon- 
nortoalized  variance  measure  VPLFOD(* )  is  the  only  measure  with  no 
overlap  between  the  high-frequency  non-specialty  and  specialty  rank 
order  intervals,  and  it  separates  the  average  rank  order  of  non¬ 
specialty  and  specialty  words  by  400.8. 

The  Poisson-normalized  variance  measure  also  perform#  better 
than  any  of  the  other  tested  measures  in  the  mid-frequency  range  and 
the  low  frequency  range.  (Mote  that  this  measure  la  the  one  analogoua 
to  the  measure  that  Sally  Dennis  (1967)  found  to  be  the  best  discrim¬ 
inator.)  In  the  low  frequency  region  all  of  the  measures  have  a  fair 
amount  of  overlap  between  non-specialty  and  specialty  rank  order  inter¬ 
vals,  suggesting  that  the  values  of  the  measures  are  somewhat  erratic 
when  based  on  a  small  number  of  occurrences.  However,  the  separation 
between  the  non-specialty  and  specialty  average  rank  orders  for  this 
region  shows  that  the  distribution  of  values  for  specialty  words  la 
well  displaced  from  the  distribution  for  non-specialty  words.  Except 
for  the  standard  variance  VI^0D(«),  the  measures  perform  better  in  the 
mid-frequency  range  than  in  the  low  or  high  frequency  ranges. 


(1.3  1  ft  »  *>t  nil 

The  ocher  property  of  single  word  distribution  measures  which 
vc  investigated  was  the  correlation  between  these  measures  and  the 
specificity  or  generality  of  the  words.  This  correlation  was  evalu¬ 
ated  in  u  eunner  similar  to  the  evaluation  oi  the  npccialty  *  non- 
specialty  discrimination  of  the  measures.  Five  sets  of  words  were 
used,  oach  of  them  partitioned  into  two  subsets,  a  group  of  relatively 
general  words  and  a  group  of  relatively  specific  words.  The  terms  in 
the  more  general  subset  were  closely  related  and  about  equally  general 
members  of  a  subject  area  category,  while  each  term  in  the  more  spe¬ 
cific  subset  was  related  to  one  or  more  of  the  general  terms.  In  the 
first  four  sets  the  relation  was  that  of  specific  term  to  generic  term, 
i.e.,  the  set  of  things  named  by  the  specific  term  was  a  subset  of  the 
set  of  things  named  by  the  generic  term.  In  the  fifth  cat  each  term 
in  the  more  specific  set  was  related  to  one  or  more  terms  In  the  more 
general  set  either  by  being  a  part  or  component  of  it  or  by  being  spe¬ 
cific  to  it.  The  members  of  the  sets  were  determined  subjectively  and 
are  listed  in  Table  4-3.  Six  measures  were  evaluated  for  their  ability 
to  separate  the  rank  orders  of  the  words  in  the  relatively  general  set 
from  those  of  the  words  in  the  corresponding  relatively  specific  set. 
The  result  for  one  measure  and  one  pal?  of  subsets  ia  displayed  in 
Figure  4-2.  The  rank  order  data  for  all  six  measures  and  all  five 
sets  are  exhibited  in  Tables  4-4a,  4-4b,  and  4-4c.  It  is  evident  that 
generality  determination  is  a  more  dJ  It  ilt  statistical  task  than 
specialty  -  non-specialty  discrimination.  The  separation  of  average 
rank  order  of  general  from  specific  is  not  so  great  as  the  separation 
of  specialty  from  non-specialty,  in  general.  However,  due  to  the 
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DlsirtHut ion  Measures  "«..r-urcs  of  Generality, 
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Gen.  r ; c  Terms 


S;i~ri:'ic  Terms 
<Mr.oi 

Cobol 

Context-free 

CPI. 

Fortran 
1 1*1.— V 
J  o  v  i  a  1 
LI  sp 

Li  st-proccssing 
Machine-  independent 
Mach Inc -or i on  ted 
Madcap 
Meta  language 

n/i 

Pro’)  lcm-or  tented 

Trac 

Xpop 


tenor  it  , trms 
i'rocc.’vsoi* 
Processors 
Software 
System 
Systemr, 

Specific  Terms 
Assembler 
Assemblers 
Assembly 
Compi latlon 
Compi lor 
Compi lers 
Compiling 
Executive 
Interpreter 
Macrogenerator 
Monitor 
Supervisor 
Translation 
Translator 
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Generic  Terms 
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Devices 
Erpii  pment 
Hardware 
Unit 
Units 

Specific  Torino 
Console 
Consoles 
Core 
CPU 
Disk 
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Printers 
Remote 
Tape 
Tapes 
Terminals 
Typewriter 
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Memory 
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Structure 
Structures 
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Disk 
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Files 
Format 
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General  Torms 
Program 
Programs 
Routine 
Routines 
Subprograms 
Subroutine 
Subroutines 
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Instruction* 

Loop 
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Macros 

Operation 

Operations 
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Figure  4-2.  A  Sample  of  the  Ranking  Inducod  by  the 
Poisson-Lased  Generality  Measure,  S(-). 


Table  4-4a.  Rank  Order  Data  for  the  Sets  of  General  and  Specific  Specialty  Terns  of  Table  4-3 


Table  4-4b.  Separation  of  Average  Rank  Order  for  General  Subset  from  Average  for- Specific  c’»bset 
for  the  Five  Groups  of  Terms  of  Table  4-3. 


amount  of  overlap  between  the  general  and  apeclflc  rank  orders,  the 
separation  of  average  rank  ordera  la  not  so  useful  a  atatlatle.  Vhat 
is  needed  is  a  measure  of  degree  of  overlap,  one  that  is  more  infor¬ 
mative  then  the  width  of  the  overlap  Interval.  We  have  choaen  the 
following  measure  for  ri.fs  purr?"*’ 

Nn (G)  hn(P) 

D  *  Nt(G>  *  Mt(P)  * 


where  G  is  the  general  set, 

P  la  the  specific  set, 

N^(C)  is  the  number  of  general  terms  in  the  overlap 
interval, 

M^(G)  is  the  total  number  of  terms  in  the  general 
set,  and 

Nq(P)  and  NT(P)  are  defined  similarly. 

A  lower  value  of  this  measure  corresponds  to  less  objectionable  overlap, 

VG> 

The  fraction  ^ is  the  proportion  of  general  terms  which  would  be 
x'1* 

misclassified  if  the  boundary  attempting  to  segregate  general  and 

specific  terms  were  placed  so  as  not  to  saisclassify  any  specific  terms, 
Nq(P) 

and  conversely  for  ^  \pj‘  '  ,ui  for  taking  the  product  of  the 

two  ratios  is  this;  for  s  given  number  of  terms  in  the  overlap  inter¬ 
val  (Nq(G)  +  Nq (F) ) ,  a  greater  difference  between  Nfl(G)  end  NQ(P) 
corresponds  in  general  to  a  situation  in  which  fewer  total  misclassifi- 
cations  will  be  made.  Suppose  Nq(P)  i«  large  while  Nq(G)  it  just  1  or 
2.  Then  if  the  boundary  is  placed  so  as  to  mlsclassify  none  of  the 
specific  terms,  only  1  or  2  general  terms  will  be  misclassified.  The 


product  Nq (G) «Nq (P)  is  smaller  when  the  difference  between  N^(G)  and 
Nq (P)  is  greater,  assuming  the  sum  is  the  same.  Hence  the  measure 


y» 


(C)  •!'„  (r) 


Nt(G)'MT(l*) 


has  a  lower  value  in  llii-.  situation. 


Thu  last  row  in  Table  4 -4c  ii.  Lhc  sum  for  each  measure  of  the 
v,. l ue :•  nf  D  for  the  five  uoca  of  words  tested.  The  measure  with  the 
lovcat  (beat)  sum  is  the  Poisson  measure  S(*)»  and  AFDC(‘)  and  AFOC(*) 
were  ulmost  as  good  no  S<«)  in  measuring  generality.  It  is  interesting 
to  note  that  the  performance  of  the  coefficient  of  skewness  CLFOD(>) 
was  not  nearly  as  good  as  that  of  S(*),  whereas  In  the  specialty  -  non- 
specialty  discrimination,  they  were  more  nearly  comparable.  The  three 

i 

measures  with  che  best  overall  general-specific  discrimination  ability 
all  had  more  difficulty  with  Group  5  than  any  other  group,  and  this 
group  was  the  one  that  hud  the  relation  whole-part  along  with  generic- 
specific. 


The  most  successful  genera l -specific  discriminator,  S(>)»  wao 
defined  as  the  site  of  the  hypothesised  document  collection  for  which 
th*  •v'n-ser'?  values  of  CM "-document  frequency  would  most  nearly  fit 
a  Poisson  distribution.  The  motivation  for  this  definition  was  tha 
thought  that  a  very  specific  (particular)  word  might  have  roughly  a 
Poisson  distribution  within  a  small  subset  of  the  docunent  collection, 
namely  chat  portion  which  covers  the  subfield  in  which  the  specific 
word  denotes  a  concept.  The  word  probably  will  not  have  been  used  in 
all  the  documents  in  the  subset,  though  one  can  imagine  Chat  it  was  In 
the  authors1  "pool  of  available  words"  while  writing  these  documents 
and  was  used  some  of  the  time  while  a  synonym  was  used  at  other  times. 
For  documents  outside  the  subset,  this  specific  word  was  for  all  prac¬ 
tical  purposes  not  in  tha  "pool  of  available  words".  Our  experimental 
results  encourage  ua  to  make  tha  following  working  definition:  tha 


gcncrtlicy  of  a  content  word  With  respect  to  a  given  «•  **- 

pressed  by  a  set  of  iJn-,.,. .. nts  is  cho  proportion  of  documents  in  this 
set  to  which  the  word  is  relevant,  The  ratio  S(«)/H  can  be  considered 
an  attempt  to  estimate  generality. 

4 ,4  •  ry  of  Distribution  Measure  Kv/ilunc  ion 

In  summary,  our  conclusion  is  that  among  the  measures  we  tented, 
there  is  no  single  measure  which  will  order  tha  words  roughly  in  the 
following  way:  non-specialty  words,  general  specialty  words,  specific 
specialty  words.  Our  finding  has  been  that  two  different  measures  are 
needed,  one  for  separating  non-specialty  front  specialty  words,  and 
another  for  separating  general  from  specific  words,  among  tha  specialty 
words.  Zt  is  not  too  surprising  that  general-specific  discrimination 
should  be  more  difficult  than  non-specialty  -  specialty.  The  problem 
of  homographs  probably  accounts  for  much  of  the  difficulty.  Homographs 
would  not  cause  much  trouble  for  non-apecialty  -  specialty  discrimina¬ 
tion,  because  most  homographs  would  be  judged  specialty  words  for  sll 
of  their  meanings.  Homographs  like  “will",  .which  can  be  either  a  nor.- 
spccialty  (function)  word  (an  suxilliery  verb)  or  e  specialty  word  (s 
legal  document),  ere  quite  rare.  However,  it  Is  llksly  that  the  dif¬ 
ferent  meanings  of  «  specialty  word  homograph  are  on  dlffersnt  levels 
£  generality. 

The  final  evaluation  of  single  word  distribution  measures 
involved  their  ranking  of  e  set  of  52  terns  of  apscial  importance  for 
the  field  of  computer  programming.  These  terms  were  obtained  from 
three  sources;  terms  used  in  manual  indexing  of  the  docuaente (reviews) 
which  were  the  input  data  for  the  statistical  processing,  terms  defined 
in  the  IFZP-ICC  Vocabulary  of  Information  Processing  (1966)  and  terms 


&! 


<■  ia  the  subject  indices  of  >«,,*tl>ook*  w  computer  pro- 

Table  4-5  enumerate*  these  words.  The  evaluation  data  are 
illustrated  In  Table  4*6.  Data  on  cha  seven  control  common  terms  is 
slso  Included  in  those  LfiMe  ..  uthcr  than  the  variance  VLFOD(*)»  which 
lmoUj  uiu  common  words  lower  than  the  index  words,  the  measure  Vi'UUb(') 
gives  the  lowest  (best)  average  rank  order  for  the  index  terms,  no 
doubt  because  both  general  and  specific  technical  terms  are  used  in 
indexing,  and  have  a  high  value  (low  rank  order)  of  VPLFOD(»)« 

4.5  Evaluation  of  Word  Association  Applications 

Two  uses  of  statistical  word  association  were  investigated  in 
our  current  experiment,  and  there  la  •  different  type  of  evaluation 
appropriate  for  each.  First,  measures  of  association  were  used  in  the 
word -inference  process  of  expanding  a  Email  ‘'kernel11  vocabulary  into  a 
larger,  more  comprehensive  vocabulary  for  a  specialty  area,  and,  second, 
they  were  used  directly  in  interactive  (men-machtne)  retrieval  request 
formulation, 

4 . 6  The  Expansion  Process  Using  Word  Association 

To  test  the  expansion  process  described  In  Section  2,  we  chose 
e  set  of  mid-frequency  specialty  terms  concentrated  in  the  subfield  of 
computer  software.  The  set  'd  of  14  terms,  all  of  which  had  a 

frequency  between  25  and  75  in  .he  collection.  The  expansion  process 
began  with  tills  set  of  14  as  the  "kernel"  end  the  rest  of  the  984  terms 
in  the  reduced  concordance  (except  for  the  seven  control  comnon  terms) 
as  the  non-kernel  set.  After  four  cycles  of  the  expansion  process,  s 
total  of  96  terms  were  added  to  the  original  kernel.  Table  4-7  ahows 
which  terms  were  added  by  each  iteration  and  gives  the  frequency  for 


Table  6-3.  Actual  Index  Terms  ami  Control  Common  Terwe. 


Croup  1.  Ten.',;-,  (..Uli  frnq,  100  in 

Computing  Reviews  text)  used  In,  manual 
wi !  x ; ;V“  of  Computing  Reviews  or 


.  \  . ru  .!  from 

subject  Indices  of 

,  .  .'<ooks. 

Addressing 

Instructions 

Algorithm 

Manipulation 

Allocation 

Matrix 

Assembler 

Non-mimorlcal 

Automatic 

On-line 

Coding 

Parallel 

Construction 

Push-down 

Control 

Recursive 

Design 

Routine 

Error 

Simulation 

Execution 

Storage 

FI  I  os 

Symbol 

Format 

Syntax 

implementation 

Time-sharing 

Index 

Translation 

Input 

Group  2.  Terms  (with  Treq. 
100  In  Computing  Reviews 
taxt)  from  IFIP-ICC  Vocab 
ulory  and  not.  in  Croup  l. 


Assembly 

Autocode 

Code 

Declaration 

Ident 1  flora 

Instruction 

Interpreter 

Interpretive 

Operation 

Procedure 

I’roblcm-orl  anted 

Sete 

Translator 


Croup  3.  Term*  from  any  of  the  three 
above  sources  which  had  freq,*  100 
In  Computing  Reviews  text. 


Algol 

Cobol 

Coapl ler 

Language 

List 

Machlno 

Program 

System 


Group  A.  Control  Common  Torus  from 
Time  Magazine  (freq.  In  Coeaputlng 
Review*  text  la  indicated  In 
parentheses). 


Also 

<1  00) 

An 

(495) 

And 

(1726) 

If 

(107) 

Or  • 

(270) 

Though 

(21) 

Very 

(77) 

63 


Table  4-6.  Average  Rank  Order  for  Aetna!  Index  Terms 
nnd  Control  Cwwaon  Terms  of  Table  4-5. 


Term  Set* 

,  1 

CRFOD 

cuoo 

CAFOO 

“s 

TLFOT) 

VLFOn 

'vplfod 

Group  1(  Index) 

515.7 

476.4 

484.4 

524.4 

263.4 

183.0 

208.1 

Group  2  (IFIf>)  497.2 

503.7 

516.9 

504.0 

378.1 

391,8 

368.8 

Croup  3  (Core) 

850, 5 

858.5 

859.8 

806.1 

12.6 

10.0 

79.4 

Croups  1,2,  &  3 

562.5 

543.2 

550.3 

562.7 

253.5 

208,6 

228.5 

Croup  4  -  Contro 
Common  Terms 

J21.9 

904.0 

882.6 

907.9 

119. 0 

85.9 

433.3 

64 


Tnb le  4-7 a.  Stages  In  the  Expansion  of  a  "Kernel"  tat  of 
Specialty  Terms  by  Means  of  Statlstlcd Word  Association 


|  Original  Kernel  - 

!  14  Tortus  In  Software  Area 
!  for  Which  23*  AFOC*  75. 

I-  -  . 


,  Term 

AFOC 

• 

VPLFOD 

Order 

Allocation 

26 

54 

!  Assembly 

41 

241 

•  Coding  - 

in 

93 

1 

j  Execution 

31 

280 

Jovial 

46 

3 

Lisp 

57 

1 

Lists 

38 

207 

List-processing 

26 

51 

Routine 

46 

18 

Software 

27 

324 

Subroutines 

32 

164 

Translation 

58 

116 

Trane la tor 

49 

80 

Average: 

40.4 

140.0 

Iteration  1  - 


30  Terms  Added  Because 
|  Sum  i . 40  or  Sum  of  Sq.£.06 


Term 

AFOC 

VPLFOO 

Order 

5 

19 

139 

Addresses 

18 

103 

Alttol 

221 

38 

Arithmetic 

52 

357 

Assemblers 

7 

273 

Author 

210! 

403 

Automatic 

58 

227 

Compiler 

124 

113 

Computer 

277 

155 

Described 

99 

572 

Dynamic 

13 

294 

Fortran 

118 

57 

Hardware 

30 

252 

IBM 

51 

173 

Input 

BC 

42 

Intermediate 

30 

131 

Internal 

25 

49 

IPL-V 

16 

7 

Language 

519 

102 

Languages 1 

158 

122 

List 

109 

34 

Machine 

135 

140 

Paper 

299 

347 

Program 

256 

156 

Programming 

288 

133 

Register 

7 

37 

Storage 

64 

75 

Symbolic 

30 

296 

Use 

181 

503 

Used 

169 

S  709 

Average i 

1122.1 

201.3  ! 
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Table  4. 7b,  Stage*  In  the  Expansion  of  a  "Kernel**  Sot  of 
Specialty  Terms  by  Means  of  Statistical  Word  Association 


1  Iteration  2  - 


8  Term*  Added  nccau.'.o 
“  Sum?  1,70  or  Sum  <h  Sq,)?,17 


Term 

AFOC 

VPLFOD 

Order 

1 

34 

243 

60 

67 

114 

Describes 

62 

971 

Index 

18 

32 

Manual 

57 

44 

Output 

45 

232 

Structures 

42 

240 

System 

314 

41 

Average: 

79.8 

239.6 

Iteration  3  • 

12  Terms  Added  Because 

Sum  &  1.45  or  Sum  of  Sq.?.l45t 


Term  j 

KFOC 

VPLFOD 

Order 

Arrays 

25 

448 

Basle 

71 

419 

Dn  tn 

148 

50 

Description 

111 

308 

Functions 

65 

188 

Level 

45 

135 

rocesslng 

,  '** 

201 

Programs 

144 

228 

Source 

51 

341 

Systems 

109 

333 

Using 

51 

897 

Written 

83 

636 

Average: 

84.0 

348.6 

Tablo  4-7c.  Final  Stngo  in  the  Expansion  of  s  "Kernel*  Set  Of 
Specialty  Toms  by  Moans  of  Statistical  Word  Association 

Iteration  4  - 

i 

'  46  Terms  Added  Because 


Sum  :1.A5  or  Sum  of  Sq.^.125  j 


*  "*  *■  ■ 

vn  FOD 

Term 

AFOC 

Order 

630 

7 

81 

Addressing 

21 

146 

Although 

57 

807 

Article 

83 

282 

Automatically 

7 

313 

Available 

58 

371 

Book 

103 

39 

Codo 

48 

no 

Computers 

72 

515 

Control 

73 

361 

Describe 

34 

798 

'  2slgn 

62 

264 

Developed 

41 

874 

Elements 

27 

686 

Example 

81 

483 

Form 

68 

719 

Forma 1 

41 

312 

Clvon 

97 

476 

However 

85 

976 

Implementation 

42 

428 

Implemented 

20 

914 

Information 

96 

354 

Introduction 

47 

449 

Memory 

48 

236 

Notation 

85 

36 

Numbor 

no 

336 

Object 

35 

190 

Operations 

43 

656 

Performance 

12 

311 

Permits 

26 

840 

Present 

51 

889 

Problem 

135 

21* 

Problems 

87 

487 

Procedures 

53 

345 

Produce 

20 

873 

Programmer 

259 

Recent 

U 

279 

Recursive 

32 

397 

Report 

61 

66 

Required 

59 

744 

Several 

4 

945 

Simultaneous 

1 

453 

Structure 

51 

314 

Symbol 

36 

378 

Techniques 

5  i 

525 

Variables 

37 

386 

Average; 

53. | 

454,6 
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each,  *s  wall  as  the  rank  order  Induced  by  the  specialty  •  non-specialty 
discrimination  measure  VPLFOO(-)* 

The  threaholu:;  for  sum  of  associations  end  sum  of  squares  of 
associations  ware  chosen  before  each  cycle,  accounting  for  the  varia¬ 
tion  in  number  of  terms  added  each  time.  An  improvement  would  be  to 
compute  the  sum  end  sum  of  squares  of  aaiociationa  for  all  term#  first 
end  then  set  the  thresholds  (sither  manually  or  automatically),  taking 
into  account  how  many  terms  would  ba  added  by  the  value*  choaen. 

One  property  of  the  expaneion  process  in  this  example  la  the 
tendency  for  added  terms  to  be  in  the  same  aubfiald  ea  the  original 
kernel.  Only  a  few  of  the  hardware  or  application  terns  from  the  vo¬ 
cabulary  ware  added.  Some  non-specialty  words  passed  the  tests  and 
got  in,  but  moat  of  these  had  a  high  rank  order  for  the  distribution 
measure  VPLFOD(-) ,  and  could  be  screened,  out  automatically  on  this 
basis. 

Zt  la  instructive  to  look  at  specific  terms  and  see  the  pattern 
of  associations  which  brought  them  into  the  kernel.  That  term  "60"  was 
added  In  iteration  2,  for  example  largely  because  of  Its  strong 
association  with  "Algol",  which  was  added  in  iteration  1.  Another 
example  la  illustrated  in  Figure  4-3,  where  lines  joining  words  repre¬ 
sent  those  association  measures  greater  than  .04  which  resulted  in  the 
addition  of  the  word  on  the  right  to  the  kernel.  This  figure  la  a 
portion  of  what  Lauren  Doyla  (1961)  would  call  an  association  map.  Tbs 
word  "Register"  was  added  in  iteration  1  primarily  because  of  its 
ociation#  with  "Allocation",  "Compilers",  and  "Lists".  Due  to  the 
large  association  with  "AIK  tlon",  it  passed  the  sum  of  squares  teat. 
The  term  "Index"  was  added  by  Cha  bus  of  squares  criterion  in  iteration 
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Figure  4-3.  Partial  Association  Map  for  Soaa  of  tha  Tarws 
Involved  In  the  Expansion  Process  Based  on  Statistical 
As  3  QC iuwl on • 


fiQ. 

2,  largely  due  to  its  association  with  "laglotar".  Kota  that  It  had 
both  a  dirtct  association  with  "Allocation"  and  an  indirect  association 
through  "Ragistar".  Tha  problam  of  index  register  allocation  was,  not 
surprisingly,  tha  cantrsi  topic  in  ona  of  tha  docuoents  in  tha  collec¬ 
tion.  It  is  interesting  that  primarily  because  of  tha  large  associa¬ 
tion  botwaen  "Allocation"  and  "Storage",  tha  latter  term  passed  both 
tha  sun  and  sum  of  squares  thresholds  in  itarstion  1.  Moat  of  the  tens 
added  in  iteration  1  passed  only  tha  sun  threshold. 

Further  examples  era  given  in  tabular  fon  in  Table  4-8.  Each 
b©»;  enumerate  a  for  a  given  tara  the  tenu  (with  an  association  mas  sure 
of  et  least  .04  with  tha  given  tara)  which  contributed  directly  to  the 
acceptance  of  the  given  term  into  tha  kernel.  Tha  group  of  boxes  taken 
together,  however,  indicate  some  of  tha  Indirect  contribution*.  The 
term  "Language",  for  example,  which  cams  into  the  karnal  in  Iteration 
1,  contributed  to  the  incorporation  of  tha  terms  "Data"  and  "Processing" 
in  iteration  3,  both  indirectly  through  "Structures"  and  directly. 
"Structures"  was  a  sum  of  square*  addition,  "Data"  was  a  sum  addition, 
and  "Processing"  passed  both  threshold  casts.  "Structure",  "Symbol", 
and  "Information"  war*  all  added  In  the  fourth  iteration  by  tha  sun 
test. 

From  this  sample  of  the  working*  of  tha  expansion  process,  it 
appears  that  both  tha  sum  of  associations  and  tha  sum  of  squares  of 
associations  are  useful.  It  is  also  clear  that  interesting  word*  era 
being  added  aa  lata  as  tha  fourth  itarstion,  and  Chat  these  tend  to 
have  more  or  stronger  direct  connections  with  words  addad  to  the  kernel 
than  words  originally  in  tha  karnal.  Further  research  on  tha  expanaion 
process  is  obviously  netded,  but  our  preliminary  explorations  have  made 


T.tlile  4-ftn,  Assoclat- lor.c  (^.04)  Contributing  to 
the  Sum  of  Squares  (.190)  Which  Caused  "Structures* 
to  be  Added  to  the  Kernel  during  Iteration  2. 

(The  Sum  of  Association*  was  1.27.) 


Orlgl nal 

Assoc  lotion 

First 

Association 

“••■rnei 

Moas.  with 

Iteration 

Maas,  with 

1  "Structures** 

Term 

"Structures* 

i.Wsp 

'|  .13 

Algol 

’.04 

List* 

'  .08 

List 

.36 

: 

Dynamic 

.04 

Language 

.06 

i 

Languages 

.04 

i 

Mach 1  no 

.07 

l 

i 

Paper 

.05 

! 

Programming 

.04 

Table  4-flb.  Associations  (?  .04)  Contributing  to 
the  Sum  of  Associations  (1,47)  Which  Caused  "Data* 
to  be  Added  to  the  Kernel  during  Itcrstlon  3, 

(The  Sun  of  Squaras  was  .080.) 


r  Original 

Aasociation 

First 

Association 

|  Kemol 

Mess,  with 

Iteration 

Moss,  with 

|  Term 

“Data* 

Term 

"Data" 

!  Allocation 

.05 

Automatic 

.05 

execution 

.05 

Coapi ler 

.05 

Routine 

.08 

Fortran 

.05 

i 

IBM 

.05 

t 

Input 

.08 

1 

Language 

.04  ! 

! 

Machine 

.04  1 

1 

Paper 

.04  1 

1 

Program 

.04 

Programming 

.06 

Storage 

.05  j 

1 

1 

Used 

.03  j 

'  Socond 

Association 

* 

1 

j  Iteration 

Meat,  with 

1  Term 

"Data" 

1  * 

‘  Structure* 

.11 

1 

System 

.07 

1 
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Table  6«fle.  Associations  ( P  .  04)  Contributing  to 
the  Sum  (1.92)  end  Sum  o£  Squares  (.155)  of  Assoc* 
lattoos  Which  Caused  "Processing"  to  be  Added  to 
cho  Kernel  during  Iteration  3, 


Original  1 

Association 

Kernel 

Mcas.  wiili 

Ter* 

1  "Process ins" 

j 

Lisp  j 

1 

.07 

i 

! 

! 

1 

Second 

j 

Association 

Iteration 

Mcas.  with 

Ter* 

"Process ing" 

Manual 

.09 

Structures 

,12 

System 

.12 

j  First 

Association 

1  Iteration 

Mcas.  with 

|  Term 

"Processing* 

j  3 

.04 

Algol 

.04 

I  Author 

.04 

Automatic 

.05 

Computer 

.05 

Described 

.04 

|  Internal 

.06 

1  IPI.-V 

.07 

t  Language 

.08 

•  Languages 

.05 

List 

.22 

Llst«procasslng 

.04 

'  Machine 

.05 

i  Paper 

.05 

j  Programming 

.06 

[_  Programs 

.06 
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TflMi.  '■  *t.|,  v  loo*  (*.04)  Contributing  to 
cho  Sum  ui  Associations  (1.77)  Which  Caused  "Structure" 
to  b#  Added  to  the  Kernel  during  Iteration  4. 

(The  Sum  of  Square*  was  .105.) 


Original 

Assoc  1. if  luu 

FI  .  M 

Ar.sirl.it  Ion 

Kernel 

Mem.  with 

!  1  ■  .  ..I  Inn 

Miur.  with 

Term 

’♦Structure" 

Term 

"Structure" 

' 

A 1 1'TAtlon 

.or, 

Algol 

.05 

rV:-cutLan 

,05 

Arithmetic 

.06 

Lists 

.09 

Dyn.iml c 

.04 

Language 

.on 

List 

.11 

Program 

.04 

Programming 

.04 

Use 

.04 

Second 

Association 

Third 

Association 

I  tei*atlon 

Moo*,  with 

1 torution 

Men*,  with 

Term 

"Structure" 

Term 

"Structure'* 

Describe* 

.05 

Array* 

.08 

Sasic 

.09 

Data 

.10 

Description 

.06 

JSourco 

.04 

Table  4-8e.  Associations  (?  .04)  Contributing  to 
the  Sum  of  Associations  (1.54)  Which  Caused  "Symbol" 
to  be  Added  to  th«  Kernel  during  Iteration  4. 

(The  Sum  of  Square*  was  .077.) 


Original  [Association 
Kernel  :Mcas.  with 

Term  j "Symbol" 

First 

Iteration 

Term 

Association 
Mens,  with 
"Symbol" 

Lists 

! 

.05 

5 

Dynamic 

Input 

Language 

List 

Machine 

Programming 

System 

Use 

.04 

.04 

.05 

.06 

.09 

.04 

.04 

.05 

.05 

Second' 

Association 

Third 

Association 

Iteration 

Mess,  vith 

Iteration 

Mess,  with 

Term 

"Symbol" 

Term 

"Symbol" 

Structures 

.00 

Basic 

.04 

Data 

.05 

! 

— ■ 

Processing 

.09 

Table  4«*t.  Associations  (5.04)  Contributing  to 
the  Sum  of  Associations  (i.$2)  Which  Cnu.iod 
*' information"  to  bo  added  to  the  Kernel  during 
Iteration  4,  <Tho  Sum  of  Squares  was  .105.) 


Original 

|  Association 

Fi*  «r 

Association 

Kernel 

Muas.  with 

Iteration 

Moos,  with 

Tara 

"Information" 

Term 

"information" 

Software 

.04 

Computer 

.05 

IBM 

.04 

Input 

.07 

Language 

.04 

Ha chi  no 

.05 

Taper 

.04 

Program 

.06 

Storage 

.04 

Use 

.06 

Used 

,07 

Second 

Association 

Third 

Association 

Iteration 

Mens,  with 

Iteration 

Mcas.  with 

Tens 

"Information" 

Term 

"Information" 

Index 

.05 

Basle 

.05 

Manual 

.07 

Data 

.06 

Output 

.10 

Description 

.06 

System 

•  05  | Processing 

.12 

Programs 

.05  i 

Source 

.04 

Systems 

.04  ] 

- - - - - - ...  .  .  J 
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us  optimistic  about  the  usefulness  of  the  basic  Idas, 

4.7  Word  Associations  in  Man-Machine  Interaction 

The  use  of  statistical  word  association  in  retrieval  request 
formulation  is  described  in  some  detail  in  Adaptive  Man-Machine  Inter¬ 
action  in  Information  Retrieval  by  John  S,  Edvards  (1967).  Xn  general, 
there  are  two  contrasting  methods  of  employing  word  association  data 
in  the  retrieval  procosa.  Ono  method  uses  the  data  in  so  automatic 
expansion  of  a  retrieval  request  to  include  terms  highly  associated 
with  the  original  ones,  while  the  other  preaente  the  highly  associated 
corns  to  tho  user  and  leta  him  modify  hie  request.  Our  experimenta¬ 
tion  waa  centered  on  a  variation  of  this  latter  process  in  which  the 
algorithm  for  presentation  of  associated  terms  waa  an  adaptive  algor¬ 
ithm.  A  simple  adaptive  algorithm  might  require  that  the  retrieval 
vocabulary  be  classified  and  might  adapt  by  auggaatlng  more  terms  in 
categories  containing  terms  previously  accepted  by  the  user.  Our 
approach  was  a  variation  of  this,  in  which  word  pairs  were  classified 
rather  than  aingle  words.  For  example,  the  pair  procaaaor-channel 
could  be  classified  ea  a  hardware  relation,  while  processor-compiler 
could  be  classified  aa  e  software  relation.  Each  word  pair  stored  in 
the  system  la  associated  with  e  category  ea  well  as  e  weight  (derived 
from  the  measure  of  statistical  association).  When  e  request  Is  pre¬ 
sented  to  the  system,  it  begins  searching  the  set  of  word  pairs  (stored 
In  the  form  of  lists)  looking  for  words  which  are  highly  associated 
with  one  or  more  of  the  words  in  thu  original  request.  The  criterion 
for  suggesting  word  j  to  the  user  is  that  for  some  k  i  B. 

Z  C^OO.w^i.j)  >  T, 

ice 
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vin  rc  K  is  the  set  of  word  pair  categories  (relatione),  C  ii  the  current 
request  tot,  G^OO  it  the  "gain"  of  category  or  relation  k  after  n 
iterations,  w^(i,J)  it  the  weight  of  the  pair  (i,J)  under  relation  k, 
and  T  is  the  threshold.  If  a  word  is  suggested  to  the  user,  then  it 
aiust  be  the  case  that  this  word  it  closely  associated  with  one  or  wore 
of  thsi  word a  in  the  current  index  sat  through  one  or  wore  pair  cate- 
goriea.  If  the  user  accepts  this  suggested  word,  the  gains  of  the 
appropriate  pair  categories  are  lncreaaod,  and  if  tha  uaer  rejects  the 
word,  the  gains  are  decreased.  The  number  of  eyelet  of  suggestion  and 
acceptance  or  rejection  of  terns  is  under  control  of  tha  user. 

The  word  pair  data  used  in  our  experiment  came  from  the  Com¬ 
puting  Reviews  text.  Three  sets  of  words  were  chosen,  pertly  on  the 
basis  of  single  word  distribution  measures,  from  the  984  words  on  the 
reduced  concordance  tape  for  this  text.  One  set  consisted  of  relatively 
specific  terms  in  the  area  of  computer  hardware,  another  of  software 
terms,  end  another  of  applications  terms.  The  association  between  each 
term  in  the  first  set  and  all  other  terms  was  computed.  All  associa¬ 
tions  greater  than  a  certain  threshold  were  kept  and  were  stored  on 
magnetic  tape  with  the  category  name  "Relation  1"  and  a  weight  derived 
from  the  measure  of  association.  The  second  set  wee  used  in  a  similar 
way  to  generate  Relation  2  pairs,  and  the  third  set.  Relation  3.  The 
formula  for  tha  aaaoclatlon  measure  between  word  1  in  one  of  the  three  ' 
sets  and  any  word  J  outside  that  set  was 


A(i,J) 


JFSC(l.l) 

i!\  - T75 

AFSC(i)  •  AFSC(j)1'3 


Associations  were  also  computed  between  pairs  of  words  in  the  same  set, 
using  the  formula 


A(i,J) 


JWC(t.l) 

AFSC(l)*  •  AFSC(J)* 


The  exponent*  1/3  end  2/3  in  Che  first  formula  were  chosen  so  as  to 
increase  tho  probability  of  relatin'*  more  general  terms  to  tha  rela¬ 
tively  npccific  terms  In  each  set,  in  accordance  with  the  results* 
reported  by  Jones  and  Curtice  (1967),  It  should  he  noted  that  a  gen¬ 
eral  term  could  be  (and  frequently  was)  the  left  member  of  pelra  in 
more  than  one  category.  For  example,  if  channel  wera  in  tha  firat  set 
(tho  hardware  sot),  and  compiler  wera  in  the  aecond  (tha  software  set), 
thon  the  pair  processor-channel  would  be  claasifiad  as  Relation  1, 
while  tho  pair  processor-compiler  would  be  Relation  2,  Thus  our  method 
of  generating  relation  data  for  this  particular  experiment  wee  able  to 
handle  multiple  meaning  or  multiple  viewpoints  at  tha  level  of  core 
terms.  Homographs  on  the  level  of  particular  terms  are  much  rarer  and 
caused  no  trouble  in  the  present  experiment. 

The  results  of  our  experiment  in  interactive  request  formula¬ 
tion  were  quite  satisfactory,  and  can  be  illustrated  by  the  date  In 
Table  4-9,  Two  dialogues  ere  summarise  ’n  this  table.  In  both, 
tho  seme  set  of  terms  was  used  In  the  o.if  nel  request  (first  column), 
and  thus  the  first  set  of  computer-generated  candidates  for  addition 
to  the  request  set  (column  2)  Is  the  same  In  both  cases.  However, 
the  next  set  of  suggested  additions  (fourth  column)  reflects  the 
aroa  of  terms  which  the  user  decided  to  accept  In  column  3,  demon¬ 
strating  the  adaptive  aspect  of  the  system. 


INTERACTIVE  REQUEST  FORMULATION 
VIT11  SOFTWARE  REINFORCEMENT 


flRTf.INAL 

REtJ'vtKST 

TERMS 

OPERATING 

SYSTEM 


SUGGESTED 

ADDITIONS 

ATIAS 

MODULES 

PRINTERS 

TRANSMISSION 

TURNAROUND 

ONLINE 

COMMAND 

MONITOR 

TIME- SHARING 


ACCEPTED 
ADD  IT TONS 

COMMAND 

MONITOR 

TIME-SHARING 


ORIGINAL 

request 

TERMS 

OPERATING 

SYSTEM 


INTERACTIVE  REQUEST  FORMULATIC 
WITH  HARDWARE  REINFORCEMENT" 


SUGGESTED 

ADDITIONS 

ATLAS 

modules 

PRINTERS 

TRANSMISSION 

TURNAROUND 

ONLINE 

COMMAND 

MONITOR 

TIME-SHARING 


ACCEPTED 

ADDITIONS 

MOOULES 

TRANSMISSION 


Table  4-9.  Sample  Result*  of  Man-Machine 
Interaction  in  Requeet  Formulation. 


NEW 

SUGGESTIONS 

TURNAROUND 

ONLINE 

DEBUGGING 

INTERRUPT 

SUPERVISOR 

REAL-TIME 

BATCH 

EXECUTIVE 


NEW 

SUGGESTIONS 

ATLAS 

PRINTERS 

TURNAROUND 

ONLINE 

PARAMETERS 

I/O 

DEVICES 

REGISTER 

SIMULATION 
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5.  .■  :"0!-;i;ndations 
5.1  Summary 

Tho  experimental  resulU  presented  in  the  previous  Metlon  ere 
quite  encouraging,  end  reinforce  our  belief  that  statistical  procedures 
c.ui  accomplish  such  of  the  work  involved  in  establishing  an  indexing 
vocabulary  for  use  in  an  information  system  and  genaratlng  relation# 
among  the  terms  in  this  vocabulary.  In  particular,  It  appeara  that 
measures  of  the  distribution  of  words  among  documents  can  be  used  to 
soparstc  specialty  (technical)  terms  for  s  subject  area  from  tha  non¬ 
specialty  (non- technical)  terns,  the  best  measure  among  the  one#  tested 
being  tho  variance  of  tho  within -document  frequencies  of  a  word  divided 
by  the  total  number  of  occurrences  of  the  word.  This  process  can  be 
augmented  by  a  word  inference  process  based  on  statistical  associations 
between  word  pairs  which  will  expand  a  set  of  specialty  terms  into  a 
larger  interrelated  set  of  terms.  Furthermore,  it  may  be  possible  to 
use  single  word  distribution  measures  to  estimate  tbe  degree  of  gener¬ 
ality  or  specificity  of  a  technical  term.  Finally,  clasaification 
tables  derived  froa  statistical  association  measures  have  bean  shown 
to  be  quite  useful  in  man-machine  formulation  of  retrieval  requests. 

The  experimental  results  presented  in  this  report  are  valuable 
mainly  as  pointers;  they  indicate  promising  directions  for  future  re¬ 
search,  and .are  thus,  we  hope,  indirect  contributions  to  tha  very 
practical  problem  of  getting  the  appropriate  information  to  tbe  people 
who  need  it.  The  continuation  of  tha  research  reported  on  here  must 
involve  an  interplay  between  empirical  Investigation  end  theoretical 
investigation,  with  modal -building  leading  to  nay  experlswnts  which  in 
turn  modify  tha  theoretical  modal. 
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5, 2  Propoanls  for  Future  Roaenreh 

One  of  Che  empirical  questions  which  needs  to  be  explored  wore 
fully  Is  che  question  of  Che  form  of  the  within-documene  frequency 
distributions  of  words.  Is  it  near  to  s  Poisson  distribution  for  func- 
cion  or  ocher  non- spec is ley  words  of  different  totel  frequencies?  How 
docs  it  vary  for  specialty  words?  Also  in  the  empirical  domain  is  the 
Coating  of  a  number  of  new  single  word  distribution  measures.  There 
is  s  class  of  measures  of  which  only  one  member  wee  investigated  in  the 
current  experiment.  This  eleet  consists  of  functions  of  the  set  of 
within-docuaent  frequencies  which  treat  the  non-sera  values  of  within- 
documcnt  frequency  differently  from  the  zero  values.  The  function  S(«) 
did  this  in  the  present  investigation.  There  era  e  greet  many  other 
functions  in  this  class  which  we  can  imagine.  One  slight  discard  s 
constant  fraction  of  the  aero  values  and  compute  standard  distribution 
measures  (e.g.,  variance)  for  the  remaining  sero  and  non-sero  values. 
Or,  the  fraction  of  sero  values  discarded  might  be  variable,  a  function 
of  some  other  statistic  like  total  frequency.  There  are  also  a  number 
of  other  measures  or  combinatloas  of  measures  which  could  be  lnvesti- 
gatcd--the  third  moment  divided  by  totel  frequency,  or  the  coefficient 
of  skewness  divided  by  its  expectation  for  e  Poisson  distribution,  to 
name  two  at  random. 

There  are  at  least  two  approaches  to  ssmi-automatic  generation 
of  e  technical  vocabulary  which  do  not  rely  on  the  distribution  of 
vithln-document  frequencies  of  single  words  end  these  should  receive 
attention.  The  first  of  these  is  exemplified  by  the  work  of  Curtice 
end  Jones  (1967)  mentioned  early  in  Section  2.  The  basic  idee  is  to 
attempt  to  determine  whether  e  word  is  a  specialty  tens  by  measuring 
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the  variation  of  the  context*  in  which  It  occurs,  a  specialty  tam  pre¬ 
sumably  occurring  in  a  more  restricted  environment  than  a  non-specialty 
term.  Such  techniques  wight  often  be  baaed  on  measures  of  the  statis¬ 
tical  association  of  word  pairs.  The  other  approach  to  vocabulary 
generation  makes  use  of  statistical  data  on  words  at  used  outride  the 
subject  area  of  interest.  The  procedure  could  be  the  following:  taka 
a  set  of  documents  covering  the  given  subject  tree  end  treat  then  as  e 
single  document  to  be  indexed  In  the  manner  of  Edaundaon  and  Wyllya 
(1961)  and  Damarau  (1965).  In  othor  words,  look  for  terms  whose  rota¬ 
tive  frequency  in  Llia  subjoct  nroa  collection  ie  significantly  higher 
than  their  relative  frequency  in  the  language  as  a  whole.  Such  words 
could  be  considered  as  indexing  the  field  as  e  whole  and  forming  Its 
technical  vocabulary. 

There  ere  aleo  a  number  of  empirical  questions  to  be  answered 
concerning  the  word-association-baaed  expansion  process.  One  question 
concerns  the  most  desirable  amount  by  which  to  expand  the  kernel  in 
each  iteration.  Clearly,  adding  a  very  large  number  of  terms  and 
adding  a  very  few  terns  are  both  undesirable.  Experimentation  la  also 
needed  with  the  incorporation  of  feedback  in  the  proeeaa,  either  from 
single  word  statistics  or  hisun  adltors.  In  order  to  reduce  the  number 
of  non-specialty  words  which  get  added. 

Experimentation  with  a  larger  data  aet  is  in  order  now  and  is 
In  progress.  Even  with  a  larger  data  sot,  however,  our  work  Is  likely 
for  some  time  to  be  lnslght-orlented  rather  than  proof .oriented,  In  the 
terminology  of  Ciulano  and  Jones  (1966).  Working  with  a  data  set  encom¬ 
passing  a  broader  subject  area  will  enable  us  to  see  how  our  techniques 
are  related  to  the  scope  of  the  document  collection. 
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Xn  tha  theoretical  model-building  area,  thara  art  several  prob¬ 
lems  co  ba  worked  on.  Xt  would  ba  desirable  Co  find  a  •edification  In 
cho  curronc  Poisson- based  nodal  which  would  alininata  tha  need  for  tha 
assumption  of  cqu.il  document  lengths.  This  might  ba  accomplished  by 
using  relative  rather  than  abaoluta  frequencies.  further  invc if tuition 
of  tha  aatinafeor  p  of  tha  Polsaon  parameter  m,  as  daflnad  in  Saotion  3. 
is  also  required.  As  experimental  work  continual  and  bettor  techniques 
for  accomplishing  different  tasks  are  discovered,  the  underlying  ex¬ 
planatory  model  of  the  statistical  phenomena  in  language  usage  ahould 
be  revised  to  take  Into  account  the  new  knowledge,  for  example,  if 
tha  meaaurea  VPLFOD ( • )  end  S (•)  are  confirmed  in  further  teating  aa 
good  measure*  for  separating  specialty  from  non-apacialty  term*  and 
general  from  specific  terms,  reapectivaly,  then  greater  effort  will  be 
needed  in  relating  the  hypotheses  about  language  usage  underlying  the 
two  meaaurea.  Xt  may  happen  that  some  very  successful  empirical  approach 
has  no  obvious  interpretation  in  terms  of  language  usage,  but  the  search 
for  such  an  interpretation  ahould  ba  pursued  tenaciously  because  of  the 
likelihood  of  lea  suggesting  furthar  axpariaancatioa  or  additional  the- 
oretical  investigations. 

Thara  are  aavaral  areas  of  reaearch  closely  related  to  tha  work 
reported  in  this  document,  and  boom  of  tha  results  obtained  hare  are 
therefore  likely  to  have  implications  for  these  areas.  One  of  these 
areas  la  automatic  Indexing  and  another  la  automatic  classification. 

The  automatic  detection  of  homographs  and  synonyms  la  also  a  related 
area.  Conversely,  thara  are  ideas  currently  being  explored  in  these 
areas  which  are  relevant  to  tha  type  of  investigations  with  which  wa 
have  bean  concerned. 
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In  conclusion,  the  goal  of  nut*  work  la  not  the  automation 
of  its  much  as  possible,  but  rather,  a*  proposed  by  Lauren  Doyle 
<1065),  the  discovery  of  the  optimal  allocation  of  taska  batvean 
man  and  computer  and  the  most  productive  form*  of  Mn«machlne 
Interaction,  fhc  rcn.irch  diacuaaod  Kara  has  contributed  to  this 
goal  by  showing  that  there  ere  a  number  of  Important  and  challenging 
problems  in  information  roerloval  with  which  statistical  techniques 


can  deal. 
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